Phase 1: Data overview
In this phase, we will import the dataset downloaded from 'Introduction', and examine the structure of the dataset for downstream analysis.
The aim is to assess the different data types within the dataset, view their distributions for preliminary exploration, and inspect missing values, which are especially common in longitudinal studies.
1) Launch PANDORA
Step 1. Open Docker and start the PANDORA container

Step 2. Open Your Terminal
On Windows, search for PowerShell in your Start menu and open it.
On MacOS or Linux, open the Terminal app.
Step 3. Run the Installation command
docker run --rm --detach --name genular --tty --interactive --env IS_DOCKER='true' --env TZ=Europe/London --oom-kill-disable --volume genular_frontend_latest:/var/www/genular/pandora --volume genular_backend_latest:/var/www/genular/pandora-backend --volume genular_data_latest:/mnt/usrdata --publish 3010:3010 --publish 3011:3011 --publish 3012:3012 --publish 3013:3013 genular/pandora:latestStep 4. Access PANDORA:
Open your browser and navigate to http://localhost:3010
2) Data upload and plot
The following steps will walk you through uploading your dataset and creating plots to examine your data structure.
Step 1. Navigate to Workspace
Step 2. Upload the covid_pitch.csv file into Workspace
Step 3. Select the uploaded dataset

Step 4. Navigate to Data Overview
Click Discovery -> Start -> Data overview

Step 5. Select and examine dataset variables
Select your sorting variable first, which is
Timepointin this example.Select columns
Age,Days pso,S-IgG,S-IgG memB SARS-CoV-2,Disease severity, andResponder.These columns are representative of key parameters to consider for answering our immunological questions proposed in the dataset description.
Check for missing values during column selection.
The number of missing values (NAs) is shown to the right of the column name.
A star (*) indicates that a column is missing 10% or more of its values.

Guidelines for Column Selection
How to select your sorting variable:
This is typically the independent variable central to your immune question which all other variables are compared against to reach your answers.
Looking at this workflow example, the immunological questions (listed below) both focus on time as an independent variable. Hence,
Timepointswas used as the sorting variable.Are there certain immune parameters that can explain the disease severity experienced by individuals and that are dependent on time post SARS-CoV-2 infection?
Can we utilize certain immune parameters measured early after infection to predict whether an individual builds a durable immune response to SARS-CoV-2?
How to select columns for examination:
Consider columns from categories of variables that are essential to answering your proposed immunological question.
In this example, as described in the data overview, these are columns in the categories of:
Clinical symptoms, immunological parameters, responder status, demographics, and time
Pandora allows you to select up to a dozen columns at a time for examination, so you may need to generate multiple plots.
Handling Missing Values
Caution should be taken when using median imputation for features containing more than 10% missing values (NA). In these cases, you will want to check the dataset to ensure no bias in the missing values (ie, all severe cases are missing a particular timepoint measurement).
Step 6. Select 'Plot Image' to generate distribution and table plots
You will see plots similar to the ones below:


Repeat the above steps to produce plots for key categorical variables and numerical assays of interest.
3) Examine distribution plots
The distribution plot displays the frequency and spread of individual variables, hence providing information about skewness, potential outliers, and correlations between variables. Here, we will provide an example of interpreting results from the distribution plot generated in the previous steps.
For a more comprehensive overview and understanding of distribution plots, along with how to analyze the ones produced in PANDORA, visit the Understanding distribution plots page

Timepoint, Disease severity, Responder, S-IgG, and S-IgG memB SARS-CoV2Based on the distribution plot that was generated in our example, and is shown above:
The chosen variables consist of both categorical and continuous variables, as indicated by the presence of both graphs and histograms along the diagonal.
Timepointis a continuous variable with a multimodal distribution.Disease severityis a categorical variable shown with two bins, with the first class (asymptomatic) notably less present in the dataset compared to the second class (mild).Responderis a categorical variable with approximately equal numbers of 'low' and 'high' respondersS-IgG,S-IgG memB SARS-CoV2are continuous variables with right-skewed distributions.
There is significant correlation (indicated by the stars next to the correlation values and highlighted by the red boxes) between:
TimepointandS-IgGS-IgG memB SARS-CoV2andS-IgG
The box plots do not portray any notable relationships
The histogram with
ResponderandDisease severity(shown in green box) shows the following:Lower numbers of asymptomatic workers who were 'high' responders compared to workers with mild severity
A slightly higher number of mild workers who were 'low' responders compared to asymptomatic.
There are outliers in the correlation plots, as shown by the blue circles.
4) Examine table plots
This plot can be used to visualize distribution patterns for multiple variables together in a single figure, examine missing values, and understand data types and unique value counts.
For a more comprehensive overview and understanding of table plots, along with how to analyze the ones produced in PANDORA, visit the Understanding table plots page
To view the table plots, select the Table Plot tab (located left of the Distribution Plot tab)

Based on the table plots generated from our previous example, and shown above:
The leftmost variable,
Timepoint, is the sorting variable that arranges all rows from top to bottom in the order of smallest to largest values.As stated in the bottom left corner of the graph (highlighted in a red box), there are 100 bins with 4 objects in each bin.
Disease severityandRespondervariables are categorical, as shown by the legend and colored bins.There are a notable number of missing values in the
Respondercolumn (colored in red).When comparing the
Responderplot to theDisease severityplot, missing values are more prevalent in samples taken from workers with severe disease symptoms.
Timepoint,Age,Days pso,S-IgG, andS-IgG memB SARS-CoV2are numerical variables.Timepointhas a staircase-type distribution that indicates the five discrete time points at which the samples were obtained.The distribution of
Age,Days pso,S-IgGandS-IgG memB SARS-CoVportrays these variables as more continuous numerical variables.
Generally, there is no correlation between timepoints and concentration of spike protein-specific IgG produced from memory B cells.
S-IgG, which is log-transformed and represents overall spike protein-specific IgG concentration, appears to be higher at the later time points compared to the earliest time points.The samples with the highest IgG concentrations in either variable generally correspond with a high responder
You've now uploaded your dataset and completed an initial inspection to understand variable types, distributions, and missing values. These initial steps ensure your data is clean and well understood before performing more comprehensive analyses and running predictive models.
Last updated
Was this helpful?