Phase 1: Data overview

Purpose

The purpose of this phase is to examine the structure of the dataset, Specifically, the aim is to assess the different data types within the dataset, view their distributions for preliminary exploration, and identify missing values, which is especially common in longitudinal studies.

Action

1) Launch PANDORA

Open Docker and start the PANDORA container

Open Your Terminal:
- On Windows, search for PowerShell in your Start menu and open it.
- On MacOS or Linux, open the Terminal app.

Run the Installation command:

docker run --rm --detach --name genular --tty --interactive --env IS_DOCKER='true' --env TZ=Europe/London --oom-kill-disable --volume genular_frontend_latest:/var/www/genular/pandora --volume genular_backend_latest:/var/www/genular/pandora-backend --volume genular_data_latest:/mnt/usrdata --publish 3010:3010 --publish 3011:3011 --publish 3012:3012 --publish 3013:3013 genular/pandora:latest

Access PANDORA:
- Open your browser and navigate to http://localhost:3010

2) Data Upload

Navigate to Workspace
Upload the covid_pitch.csv file onto the Workspace
Select this dataset to start exploring and analyzing!

3) Data Overview for Initial Exploration

Navigate to Data Overview by going to Discovery -> Start -> Data overview

Select up to 5 variables for visualizing data distributions
1. The first variable will be selected as the sorting variable
2. Identify key columns: In this study, the data can be divided into several important categories- Donor ID, Timepoints, immunological assays, demographics, clinical symptoms, Disease severity and Responders . Hence, aim to view distributions that are representative of these categories
3. Missing values (NA): The number of missing values in each feature is provided during column selection. A star next to the number of NAs indicates that <10% of the values are NA in that feature

After selection of desired features, select 'Plot image', and the distribution and table plots will be generated for the selected columns

As this dataset consists of both categorical and numerical features, below is an example workflow of visualizing both data types.

Assessing Data from COVID Pitch dataset by data types

As there are multiple data types within this dataset, it is valuable to view the distributions of these data types. Here, we are analyzing the distributions of the categorical variables, specifically Donor ID (manually added to the dataset), Responder, Disease Severity, Sex, and Change or loss of taste.

Last updated 2 months ago

Was this helpful?