BabelBetes
The BabelBetes project aims to standardize publicly available clinical trial data on continuous glucose monitoring (CGM) and insulin pump delivery, reducing the costs and time associated with data translation for researchers. Motivated by the challenges of inconsistent data formats, BabelBetes will streamline access to usable datasets, accelerating innovation in type 1 diabetes care.
Challenges with Publicly Available Clinical Trial Data
Data is the raw material from which models are developed, simulations are composed, and new therapies to reduce the burden of living with type 1 diabetes are developed.
Clinical trials performed at great time and expense, funded by Breakthrough T1D, HCT, and NIH have provided large volumes of granular data which is often stored publicly (www.jaeb.org) or otherwise readily accessible (OPEN Project, OpenAPS, Nightscout Data Commons).
Unfortunately this is often the only data available to researchers and developers seeking to provide innovative solutions for people with type 1 diabetes, putting them at great disadvantage relative to leading medical device companies who together gather more data per day than exists in the entire public domain, ever (approximately 500,000 subject-days).
To add to this, public available data is not stored with consistent methods or formats, resulting in a confusing array of file formats and data descriptors which must be translated at great effort and with high probability of error by each and every researcher or developer hoping to gain insights.
Last Mile Problem
Babelbetes addresses this “last mile” problem by developing a publicly available set of tools to normalize clinical diabetes trial datasets, focusing on continuous glucose monitoring and insulin pump delivery. Babelbetes also provides recommendations on a normalized data set format to ensure future activities provide shovel-ready data for researchers and developers.
This is the official project documentation
Supported Studies
Figure: Days worth of complete data including cgm, basal and bolus data for supported studies. Overall, we have approximately normalized and extracted half a million days of data.
The goal is to work with as many clinical diabetes trial datasets as possible. At the moment, the following datasets from the diabetes JAEB database are supported.
For each of these studies, we've spent hundreds of hours analyzing the data to ensure that the class correctly loads and extracts the data. Please refer to the study analysis pages for a summary of the analysis and findings that went into each dataset. While we operated with great care, some assumptions had to be made and other details remain unknown which are also documented.
Analysis & Documentation | Link | Supported Version/Retrieval Date | Folder Name * | Note |
---|---|---|---|---|
Flair | JAEB | -/April 17th, 2024 | FLAIRPublicDataSet.zip | ⚠️We don't support the newest version (September, 2024) where insulin pump data was removed from the dataset. |
DCLP3 | JAEB | Release 3 / 2022-08-04 | DCLP3 Public Dataset - Release 3 - 2022-08-04.zip | - |
DCLP5 | JAEB | -/April 17th, 2024 | DCLP5_Dataset_2022-01-20-5e0f3b16-c890-4ace-9e3b-531f3687cf53.zip | - |
IOBP2 | JAEB | -/April 17th, 2024 | IOBP2 RCT Public Dataset.zip | - |
PEDAP | JAEB | Release 4/2025-04-10 | PEDAP Public Dataset - Release 4 - 2025-04-10.zip | Our investigation resulted in two updated version: Release 3 (updated patient ids), Release 4 with complete basal date. |
T1DEXI | JAEB | -/October 1st, 2022 | T1DEXI - DATA FOR UPLOAD.zip | |
T1DEXIP | JAEB | -/March 16th, 2023 | T1DEXIP - DATA FOR UPLOAD.zip | |
REPLACE BG | JAEB | -/February 2nd, 2025 | REPLACE-BG Dataset-79f6bdc8-3c51-4736-a39f-c4c0f71d45e5 | ⚠️The currently hosted version misses the Basal file. |
Loop | JAEB | 2023-01-31 | Loop study public dataset 2023-01-31.zip | Due to the extensive file sizes, we convert the csv files to parquet files in a temporary folder to allow parallel processing of the results and avoid out of memory problems. You can delete this folder afterwards. |
* We have only tested our code on the respective versions.
If you are encountering problems with running the datasets, feel free to reach out to us.
How to Contribute
BabelBetes was funded to be freely available, helping researchers and companies save costs and time, and supercharge innovation in diabetes care.
We’re incredibly excited for contributions that will expand its functionality and support even more datasets, making a bigger impact than ever before!
Learn more about how to contribute.
Key Features of the Toolbox
1. Analaysis scripts and documentation: You can learn about the datasets and what challenges came with normalizing tem by consulting the dataset summaries. You might also consult and review the jupyter notebooks that document our analysis.
2. Python modules: You can use the python modules to extract standardized continuous glucose monitor (CGM) and insulin pump data from the supported study datasets. Reuse the helper and drawing functions to work with the data. - Extend the functionality of existing study classes or add new implementations of the StudyDataset base class to support additional study datasets.
3. Recommendations: As guidance for investigators, we've summarized our learnings and challanges in a list of recommendations that we believe would dramatically improve the quality and usability of datasets published in the future.
Data Standardization
The ultimate purpose of this toolbox is to bring CGM and insulin data into a common standardized format. We chose to abstract study datasets as objects. Each study class derives from the parent StudyDataset
class and overrides methods to extract cgm, bolus and basal data. The StudyDataset base class defineds methods to extract cgm, basal and bolus data in standardized pandas dataframes.
For example, the bolus dataframe obtained with extract_bolus_event_history()
has this format:
Column Name | Type | Description |
---|---|---|
patient_id |
str |
Patient ID |
datetime |
pd.Timestamp |
Datetime of the bolus event |
bolus |
float |
Actual delivered bolus amount in units |
delivery_duration |
pd.Timedelta |
Duration of the bolus delivery |
refer to the Code Reference for more details.
How to use BabelBetes (Quickstart)
Here, we explain how to install the toolbox and how to use the run_functions.py
script that batch processes all studies and extracts the standardized data.
Setup Python
- Make sure you have python version > 3.X installed.
- We recommend using a python virtual environment (see using vitual environments)
Installation
- Clone the repository:
sh git clone git@github.com:nudgebg/babelbetes.git
- Install all dependencies
- In your terminal, navigate to the repository
- (Optional) activate your python virtual environment
- Run this command to install all packages required by BabelBetes
pip install -r requirements.txt
Prepare the raw data
- Download the study data zip files from jaeb.org (see supported studies).
- Move the files inside the
data/raw
directory. Zipped files can either be used directly or unzipped. Do not rename the file/folder names, otherwise therun_functions.py
won't know how to process them. - Depending on which studies you downloaded and whether you have .zip archives (or unzipped folders), the folder structure should look like this:
babelbetes/
├── data/
│ └── raw/
│ └── FLAIRPublicDataSet.zip
│ └── DCLP3 Public Dataset - Release 3 - 2022-08-04
│ └── IOBP2 RCT Public Dataset
│ └── T1DEXI - DATA FOR UPLOAD
│ └── T1DEXIP - DATA FOR UPLOAD.zip
└── run_functions.py
Run run_functions.py to batch Extract data
The run_functions.py
script is the entry point for users that simply want to extract standardized data from the supported studies. It performs data extraction and standarization. For each folder in the data/raw
directory the script:
1. Identifies the appropriate handler class (see supported studies)
2. Loads the study data
3. Extracts bolus, basal, and CGM event histories to a standardized Format (see data standardization)
4. Saves the extracted data in CSV format.
Example terminal output:
> python run_functions.py
[15:26:22] Looking for study folders in /data/raw and saving results to /data/out
[15:26:22] Start processing supported study folders:
[15:26:22] 'T1DEXI' using T1DEXI class
[15:26:22] 'REPLACE-BG Dataset-79f6bdc8-3c51-4736-a39f-c4c0f71d45e5' using ReplaceBG class
...
[15:26:22] Processing T1DEXI ...
[15:26:56] [x] Data loaded
[15:26:56] [x] Boluses extracted
[15:27:00] [x] Basal extracted
[15:27:12] [x] CGM extracted
[15:27:12] T1DEXI completed in 37.43 seconds.
...
Processing complete.
Execution Times
These are approximate execution times
MacBook Pro M3 | |
---|---|
Flair | 58 seconds |
IOBP2 | 26 seconds |
PEDAP | 34 seconds |
DCLP3 | 15 seconds |
DCLP5 | 23 seconds |
T1DEXI | 37 seconds |
T1DEXIP | 7 seconds |
Replace BG | 30 seconds |
Loop | 151 seconds* |
Total | ~383 seconds |
* Loop raw data files are very large which requires the use of dask
. dask
builds upon pandas and processes chunks of the data in parallel. However, the routine to save the data to csv - at the moment - still requires the whole dataframe to be loaded into memory before storing it which might fail if your machine has insufficient memory.
Troubleshooting
- Ensure the raw data folders are named correctly to match the patterns in the script. You shouldn't need to rename the folders or zip archivesafter you downloaded the datasets.
- Check the console output for any warning or error messages.