Tushar Jamdade - GSoC 2025

Project Information

Name	Tushar Jamdade
Email	tusharnjamdade@gmail.com
Organization	Data for the Common Good (D4CG)
Mentor	Steve Krasinsky
Project Repo	pcdc-analysis-notebook
Contributors	GSoC Contributors at D4CG
Technologies	R, Docker, Apache Spark, RStudio, Shiny

Overview

Dockerized R/Shiny Application for PFB AVRO Biomedical Data Analysis

This project provides a Dockerized RStudio-based Shiny application for analyzing biomedical data stored in AVRO files using the Portable Format for Biomedical (PFB) standard. The system leverages Apache Spark to read and process large AVRO files, flattening complex hierarchical structures into tidy tabular formats for downstream analysis.

Users can explore entities and observation nodes, generate percentage distributions, and perform survival analyses (Kaplan–Meier). A configurable Table-1 feature allows the selection of covariates and statistical comparisons when grouping is applied. Outputs (Excel, PNG) are download ready for reporting.

Workflow

Users upload a file in the PFB AVRO format. Valid files expose available entities, while invalid files generate an error. Core entities such as person, subject, timing, and survival_characteristics are included by default, as they are required for survival analysis (OS and EFS curves). Additional entities may also be selected by the user. AVRO nested data

Example: nested AVRO structure before flattening.

The application flattens nested records into a tidy table, preserves relationships, and merges entity-level information to create a single dataset that can be further used for analysis.

Example: flattened and merged data for further processing.

Percentage Distribution

After flattening, all categorical variables are displayed in a dropdown. Users can select variables to compute percentage distributions and download the results as Excel or PNG.

Example: distribution visualization (exportable).

Survival Curves

Overall Survival

Overall survival uses diagnosis and last-known survival status outcomes from survival_characteristics. Kaplan–Meier estimators are applied to generate survival plots and downloadable PNGs with risk tables.

Event-Free Survival

Event-free survival uses censoring status and event timing information (from subject and timing) to compute survival curves.

Table-1 Analysis

Users can select covariates and variable types to create Table 1. Numeric variables display mean, median, and percentage of missing values, while categorical variables display counts, percentages, and missing values. When a subject ID grouping file is uploaded, two groups are generated (IDs in the file vs. remaining subjects), and p-values are computed where appropriate. Users can clear the calculation and perform another grouping, or perform a single grouping. Survival analysis can also be applied to the uploaded subject IDs.

Interactive Table-1 output (example).

Interactive Table-1 output (single grouping).

Interactive Table-1 output (grouping + p-value).