Code Reference
This is the technical documentation of the BabelBetes core modules.
babelbetes.run_functions
run_functions.py
This script performs data normalization on raw study data found in the data/raw directory.
Execution
python run_functions.py
Process Overview:
1. Identifies the appropriate handler class (subclass of studydataset) for each folder in the data/raw directory (see supported studies).
2. Loads the study data into memory.
3. Extracts bolus, basal, CGM event histories, and age data into a standardized format (see Output Format).
4. Saves the extracted data as CSV files.
Output format:
The outptut format is standardized across all studies and follows the definitions of the studydataset base class.
Boluses
bolus_history.csv: Event stream of all bolus delivery events. Standard boluses are assumed to be delivered immediately.
| Column Name | Type | Description |
|---|---|---|
| patient_id | str | Patient ID |
| datetime | pd.Timestamp | Datetime of the bolus event |
| bolus | float | Actual delivered bolus amount in units |
| delivery_duration | pd.Timedelta | Duration of the bolus delivery |
Basal Rates
basal_history.csv:Event stream of basal rates, accounting for temporary basal adjustments, pump suspends, and closed-loop modes. The basal rates are active until the next rate is reported.
| Column Name | Type | Description |
|---|---|---|
| patient_id | str | Patient ID |
| datetime | pd.Timestamp | Datetime of the basal rate start event |
| basal_rate | float | Basal rate in units per hour |
CGM (Continuous Glucose Monitor)
cgm_history.csv: Event stream of CGM values.
| Column Name | Type | Description |
|---|---|---|
| patient_id | str | Patient ID |
| datetime | pd.Timestamp | Datetime of the CGM measurement |
| cgm | float | CGM value in mg/dL |
Age Data
age_data.csv: Patient age at study enrollment/start.
| Column Name | Type | Description |
|---|---|---|
| patient_id | str | Patient ID |
| age | float | Patient age at study enrollment/start |
Output Files:
For each study, the dataframes are saved in the data/out/<study-name>/ folder:
- To reduce file size, the data is saved in a compressed format using the gzip
- datetimes and timedeltas are saved as unix timestamps (seconds) and integers (seconds) respectively.
- boluses and basals are rounded to 4 decimal places
- cgm values are converted to integers
main(load_subset=False, remove_repetitive=True, input_dir=None, output_dir=None, studies=None, data_types=None)
Main function to process study data folders.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
load_subset
|
bool
|
If True, runs the script on a limited amount of data (e.g. skipping rows). |
False
|
input_dir
|
str
|
Custom input directory path. Defaults to 'data/raw'. |
None
|
output_dir
|
str
|
Custom output directory path. Defaults to 'data/out'. |
None
|
studies
|
list
|
List of study names to process. If None, all available studies will be processed. Available studies: IOBP2, Flair, PEDAP, DCLP3, DCLP5, ReplaceBG, Loop, T1DEXI, T1DEXIP |
None
|
data_types
|
list
|
List of data types to extract ['cgm', 'bolus', 'basal', 'age']. If None, all types are extracted. |
None
|
Logs
- Information about the current working directory and paths being used.
- Warnings for folders that do not match any known study patterns.
- Errors if no supported studies are found.
- Progress of processing each matched study folder.
process_folder(study, out_path, progress, remove_repetitive, data_types)
Processes the data for a given study by loading, extracting, and saving bolus, basal, CGM, and age events.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
study
|
StudyDataset
|
Study instance to extract data from. |
required |
out_path
|
str
|
Root output directory (e.g. "data/out"). |
required |
progress
|
tqdm
|
Progress bar to update. |
required |
remove_repetitive
|
bool
|
Whether to drop repetitive basal values. |
required |
data_types
|
list
|
Data types to extract ['cgm', 'bolus', 'basal', 'age']. |
required |
babelbetes.studies.studydataset
StudyDataset
Abstract base class for clinical diabetes datasets with CGM, bolus, basal, and age data.
Subclasses implement four abstract methods:
- _extract_bolus_event_history: Return bolus events as a DataFrame.
- _extract_basal_event_history: Return basal rate events as a DataFrame.
- _extract_cgm_history: Return CGM measurements as a DataFrame.
- _extract_age_data: Return patient age at enrollment as a DataFrame.
Public properties (bolus, basal, cgm, age) validate output against pandera schemas
and cache results via cached_property. Do not override them; override the private
_extract_* methods instead.
For memory management when processing multiple studies, declare raw file cache attributes
in _raw_attrs and call unload_raw() after extraction is complete.
age
property
Patient age at enrollment as a validated, cached DataFrame.
Returns:
| Type | Description |
|---|---|
|
pd.DataFrame: Columns: patient_id (str), age (int, 0–120). |
basal
property
Basal rate event history as a validated, cached DataFrame.
Notes
- Zero basal rates (pump suspends) must be included.
- Rates are active until the next event.
Returns:
| Type | Description |
|---|---|
|
pd.DataFrame: Columns: patient_id (str), datetime (datetime64), basal_rate (float, units/hour). |
bolus
property
Bolus event history as a validated, cached DataFrame.
Returns:
| Type | Description |
|---|---|
|
pd.DataFrame: Columns: patient_id (str), datetime (datetime64), bolus (float, units), delivery_duration (timedelta). Standard boluses have delivery_duration of 0 seconds. |
carbs
property
Carbohydrate meal entries as a validated DataFrame.
Returns:
| Type | Description |
|---|---|
|
pd.DataFrame: Columns: patient_id (str), datetime (datetime64), carbs (float, grams, 0–400]. Only entries with carbs > 0 are included. |
cgm
property
CGM measurements as a validated, cached DataFrame.
Returns:
| Type | Description |
|---|---|
|
pd.DataFrame: Columns: patient_id (str), datetime (datetime64), cgm (float, mg/dL). |
unload_raw()
Free raw file caches from memory.
Call this after all needed data types have been extracted to release the memory
used by raw file DataFrames. Derived outputs (bolus, basal, cgm, age) are kept.
Raw attributes to clear are declared by subclasses in _raw_attrs.
babelbetes.studies.iobp2.IOBP2
Bases: StudyDataset
babelbetes.studies.flair.Flair
Bases: StudyDataset
get_reported_tdds(method='max')
Retrieves reported total daily doses (TDDs) based on the specified method.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
method
|
str
|
The method to use for retrieving the TDDs. - 'max': Returns the TDD with the maximum reported value for each patient and date. - 'sum': Returns the sum of all reported TDDs for each patient and date. - 'latest': Returns the TDD with the latest reported datetime for each patient and date. - 'all': Returns all TDDs without any grouping or filtering. |
'max'
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
The DataFrame containing the retrieved TDDs based on the specified method. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the method is not one of: 'max', 'sum', 'latest', 'all'. |
babelbetes.studies.pedap.PEDAP
Bases: StudyDataset
babelbetes.studies.dclp.DCLP3
Bases: StudyDataset
babelbetes.studies.dclp.DCLP5
Bases: DCLP3
babelbetes.studies.loop.Loop
Bases: StudyDataset
babelbetes.studies.t1dexi.T1DEXI
Bases: StudyDataset
babelbetes.studies.t1dexi.T1DEXIP
Bases: T1DEXI
babelbetes.studies.replacebg.ReplaceBG
Bases: StudyDataset
babelbetes.data_store
cleanup(study_name, base_path, data_types=None)
Remove existing output for a study to ensure a clean write.
Deletes
base_path/
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
study_name
|
str
|
Study whose output should be removed. |
required |
base_path
|
str
|
Root output directory (e.g. "data/out"). |
required |
data_types
|
list[str] | None
|
Data types to remove. Defaults to all four types. |
None
|
Returns:
| Type | Description |
|---|---|
list[str]
|
List of directory paths that were actually removed. |
load(base_path, data_types=None, studies=None, patients=None)
Load data from the partitioned Parquet store.
Reads from
base_path/
The returned DataFrames include a 'study_name' column populated from the Hive partition. Each data type has its own schema: cgm: patient_id (str), study_name (str), datetime (datetime64), cgm (float, mg/dL) bolus: patient_id (str), study_name (str), datetime (datetime64), bolus (float, U), delivery_duration (timedelta64) basal: patient_id (str), study_name (str), datetime (datetime64), basal_rate (float, U/hr) age: patient_id (str), study_name (str), age (int)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
base_path
|
str
|
Root output directory (e.g. "data/out"). |
required |
data_types
|
list[str] | None
|
Data types to load. Defaults to all four ('cgm', 'bolus', 'basal', 'age'). |
None
|
studies
|
list[str] | str | None
|
Filter by study name(s). None loads all studies. |
None
|
patients
|
list[str] | str | None
|
Filter by patient ID(s). None loads all patients. |
None
|
Returns:
| Type | Description |
|---|---|
dict[str, DataFrame]
|
Dict mapping data_type → DataFrame. Always returns a dict, even for a single data type. |
save(df, study_name, data_type, base_path)
Write a DataFrame to the partitioned Parquet store.
Data is written to
base_path/
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
DataFrame to save. Must contain a 'patient_id' column. |
required |
study_name
|
str
|
Study identifier (e.g. "Flair"). Written as a partition column. |
required |
data_type
|
str
|
One of 'cgm', 'bolus', 'basal', 'age'. |
required |
base_path
|
str
|
Root output directory (e.g. "data/out"). |
required |