Phase 1: Data overview

Purpose

The purpose of this phase is to examine the structure of the dataset, Specifically, the aim is to assess the different data types within the dataset, view their distributions for preliminary exploration, and identify missing values, which is especially common in longitudinal studies.

Action

1) Launch PANDORA
  1. Open Docker and start the PANDORA container

PANDORA container on Docker Desktop
  1. Open Your Terminal:

    • On Windows, search for PowerShell in your Start menu and open it.

    • On MacOS or Linux, open the Terminal app.

  2. Run the Installation command:

    docker run --rm --detach --name genular --tty --interactive --env IS_DOCKER='true' --env TZ=Europe/London --oom-kill-disable --volume genular_frontend_latest:/var/www/genular/pandora --volume genular_backend_latest:/var/www/genular/pandora-backend --volume genular_data_latest:/mnt/usrdata --publish 3010:3010 --publish 3011:3011 --publish 3012:3012 --publish 3013:3013 genular/pandora:latest
  3. Access PANDORA:

2) Data Upload
  1. Navigate to Workspace

  2. Upload the covid_pitch.csv file onto the Workspace

  3. Select this dataset to start exploring and analyzing!

Selecting COVID Pitch dataset in Workspace
3) Data Overview for Initial Exploration
  1. Navigate to Data Overview by going to Discovery -> Start -> Data overview

Steps to access 'Data overview' on PANDORA
  1. Select up to 5 variables for visualizing data distributions

    1. The first variable will be selected as the sorting variable

    2. Identify key columns: In this study, the data can be divided into several important categories- Donor ID, Timepoints, immunological assays, demographics, clinical symptoms, Disease severity and Responders . Hence, aim to view distributions that are representative of these categories

    3. Missing values (NA): The number of missing values in each feature is provided during column selection. A star next to the number of NAs indicates that <10% of the values are NA in that feature

  1. After selection of desired features, select 'Plot image', and the distribution and table plots will be generated for the selected columns

As this dataset consists of both categorical and numerical features, below is an example workflow of visualizing both data types.

Assessing Data from COVID Pitch dataset by data types

As there are multiple data types within this dataset, it is valuable to view the distributions of these data types. Here, we are analyzing the distributions of the categorical variables, specifically Donor ID (manually added to the dataset), Responder, Disease Severity, Sex, and Change or loss of taste.

Last updated

Was this helpful?