Skip to content

Code Reference

This is the technical documentation of the BabelBetes core modules.

babelbetes.run_functions

run_functions.py

This script performs data normalization on raw study data found in the data/raw directory.

Execution

python run_functions.py

Process Overview: 1. Identifies the appropriate handler class (subclass of studydataset) for each folder in the data/raw directory (see supported studies). 2. Loads the study data into memory. 3. Extracts bolus, basal, CGM event histories, and age data into a standardized format (see Output Format). 4. Saves the extracted data as CSV files.

Output format:

The outptut format is standardized across all studies and follows the definitions of the studydataset base class.

Boluses

bolus_history.csv: Event stream of all bolus delivery events. Standard boluses are assumed to be delivered immediately.

Column Name Type Description
patient_id str Patient ID
datetime pd.Timestamp Datetime of the bolus event
bolus float Actual delivered bolus amount in units
delivery_duration pd.Timedelta Duration of the bolus delivery
Basal Rates

basal_history.csv:Event stream of basal rates, accounting for temporary basal adjustments, pump suspends, and closed-loop modes. The basal rates are active until the next rate is reported.

Column Name Type Description
patient_id str Patient ID
datetime pd.Timestamp Datetime of the basal rate start event
basal_rate float Basal rate in units per hour
CGM (Continuous Glucose Monitor)

cgm_history.csv: Event stream of CGM values.

Column Name Type Description
patient_id str Patient ID
datetime pd.Timestamp Datetime of the CGM measurement
cgm float CGM value in mg/dL
Age Data

age_data.csv: Patient age at study enrollment/start.

Column Name Type Description
patient_id str Patient ID
age float Patient age at study enrollment/start
Output Files:

For each study, the dataframes are saved in the data/out/<study-name>/ folder: - To reduce file size, the data is saved in a compressed format using the gzip - datetimes and timedeltas are saved as unix timestamps (seconds) and integers (seconds) respectively. - boluses and basals are rounded to 4 decimal places - cgm values are converted to integers

main(load_subset=False, remove_repetitive=True, input_dir=None, output_dir=None, studies=None, data_types=None)

Main function to process study data folders.

Parameters:

Name Type Description Default
load_subset bool

If True, runs the script on a limited amount of data (e.g. skipping rows).

False
input_dir str

Custom input directory path. Defaults to 'data/raw'.

None
output_dir str

Custom output directory path. Defaults to 'data/out'.

None
studies list

List of study names to process. If None, all available studies will be processed. Available studies: IOBP2, Flair, PEDAP, DCLP3, DCLP5, ReplaceBG, Loop, T1DEXI, T1DEXIP

None
data_types list

List of data types to extract ['cgm', 'bolus', 'basal', 'age']. If None, all types are extracted.

None
Logs
  • Information about the current working directory and paths being used.
  • Warnings for folders that do not match any known study patterns.
  • Errors if no supported studies are found.
  • Progress of processing each matched study folder.

process_folder(study, out_path, progress, remove_repetitive, data_types)

Processes the data for a given study by loading, extracting, and saving bolus, basal, CGM, and age events.

Parameters:

Name Type Description Default
study StudyDataset

Study instance to extract data from.

required
out_path str

Root output directory (e.g. "data/out").

required
progress tqdm

Progress bar to update.

required
remove_repetitive bool

Whether to drop repetitive basal values.

required
data_types list

Data types to extract ['cgm', 'bolus', 'basal', 'age'].

required

babelbetes.studies.studydataset

StudyDataset

Abstract base class for clinical diabetes datasets with CGM, bolus, basal, and age data.

Subclasses implement four abstract methods: - _extract_bolus_event_history: Return bolus events as a DataFrame. - _extract_basal_event_history: Return basal rate events as a DataFrame. - _extract_cgm_history: Return CGM measurements as a DataFrame. - _extract_age_data: Return patient age at enrollment as a DataFrame.

Public properties (bolus, basal, cgm, age) validate output against pandera schemas and cache results via cached_property. Do not override them; override the private _extract_* methods instead.

For memory management when processing multiple studies, declare raw file cache attributes in _raw_attrs and call unload_raw() after extraction is complete.

age property

Patient age at enrollment as a validated, cached DataFrame.

Returns:

Type Description

pd.DataFrame: Columns: patient_id (str), age (int, 0–120).

basal property

Basal rate event history as a validated, cached DataFrame.

Notes
  • Zero basal rates (pump suspends) must be included.
  • Rates are active until the next event.

Returns:

Type Description

pd.DataFrame: Columns: patient_id (str), datetime (datetime64), basal_rate (float, units/hour).

bolus property

Bolus event history as a validated, cached DataFrame.

Returns:

Type Description

pd.DataFrame: Columns: patient_id (str), datetime (datetime64), bolus (float, units), delivery_duration (timedelta). Standard boluses have delivery_duration of 0 seconds.

carbs property

Carbohydrate meal entries as a validated DataFrame.

Returns:

Type Description

pd.DataFrame: Columns: patient_id (str), datetime (datetime64), carbs (float, grams, 0–400]. Only entries with carbs > 0 are included.

cgm property

CGM measurements as a validated, cached DataFrame.

Returns:

Type Description

pd.DataFrame: Columns: patient_id (str), datetime (datetime64), cgm (float, mg/dL).

unload_raw()

Free raw file caches from memory.

Call this after all needed data types have been extracted to release the memory used by raw file DataFrames. Derived outputs (bolus, basal, cgm, age) are kept. Raw attributes to clear are declared by subclasses in _raw_attrs.

babelbetes.studies.iobp2.IOBP2

Bases: StudyDataset

babelbetes.studies.flair.Flair

Bases: StudyDataset

get_reported_tdds(method='max')

Retrieves reported total daily doses (TDDs) based on the specified method.

Parameters:

Name Type Description Default
method str

The method to use for retrieving the TDDs. - 'max': Returns the TDD with the maximum reported value for each patient and date. - 'sum': Returns the sum of all reported TDDs for each patient and date. - 'latest': Returns the TDD with the latest reported datetime for each patient and date. - 'all': Returns all TDDs without any grouping or filtering.

'max'

Returns:

Type Description
DataFrame

The DataFrame containing the retrieved TDDs based on the specified method.

Raises:

Type Description
ValueError

If the method is not one of: 'max', 'sum', 'latest', 'all'.

babelbetes.studies.pedap.PEDAP

Bases: StudyDataset

babelbetes.studies.dclp.DCLP3

Bases: StudyDataset

babelbetes.studies.dclp.DCLP5

Bases: DCLP3

babelbetes.studies.loop.Loop

Bases: StudyDataset

babelbetes.studies.t1dexi.T1DEXI

Bases: StudyDataset

babelbetes.studies.t1dexi.T1DEXIP

Bases: T1DEXI

babelbetes.studies.replacebg.ReplaceBG

Bases: StudyDataset

babelbetes.data_store

cleanup(study_name, base_path, data_types=None)

Remove existing output for a study to ensure a clean write.

Deletes

base_path//study_name=/

Parameters:

Name Type Description Default
study_name str

Study whose output should be removed.

required
base_path str

Root output directory (e.g. "data/out").

required
data_types list[str] | None

Data types to remove. Defaults to all four types.

None

Returns:

Type Description
list[str]

List of directory paths that were actually removed.

load(base_path, data_types=None, studies=None, patients=None)

Load data from the partitioned Parquet store.

Reads from

base_path//study_name=/patient_id=/*.parquet

The returned DataFrames include a 'study_name' column populated from the Hive partition. Each data type has its own schema: cgm: patient_id (str), study_name (str), datetime (datetime64), cgm (float, mg/dL) bolus: patient_id (str), study_name (str), datetime (datetime64), bolus (float, U), delivery_duration (timedelta64) basal: patient_id (str), study_name (str), datetime (datetime64), basal_rate (float, U/hr) age: patient_id (str), study_name (str), age (int)

Parameters:

Name Type Description Default
base_path str

Root output directory (e.g. "data/out").

required
data_types list[str] | None

Data types to load. Defaults to all four ('cgm', 'bolus', 'basal', 'age').

None
studies list[str] | str | None

Filter by study name(s). None loads all studies.

None
patients list[str] | str | None

Filter by patient ID(s). None loads all patients.

None

Returns:

Type Description
dict[str, DataFrame]

Dict mapping data_type → DataFrame. Always returns a dict, even for a single data type.

save(df, study_name, data_type, base_path)

Write a DataFrame to the partitioned Parquet store.

Data is written to

base_path//study_name=/patient_id=/*.parquet

Parameters:

Name Type Description Default
df DataFrame

DataFrame to save. Must contain a 'patient_id' column.

required
study_name str

Study identifier (e.g. "Flair"). Written as a partition column.

required
data_type str

One of 'cgm', 'bolus', 'basal', 'age'.

required
base_path str

Root output directory (e.g. "data/out").

required