Code Reference

This part of the project documentation focuses on an information-oriented approach. Use it as a reference for the technical implementation of the BabelBetes project code.

`run_functions`

run_functions.py

This script performs data normalization on raw study data found in the data/raw directory.

Execution

python run_functions.py

Process Overview: 1. Identifies the appropriate handler class (subclass of studydataset) for each folder in the data/raw directory (see supported studies). 2. Loads the study data into memory. 3. Extracts bolus, basal, and CGM event histories into a standardized format (see Output Format). 4. Saves the extracted data as CSV files.

Output format:

The outptut format is standardized across all studies and follows the definitions of the studydataset base class.

Boluses

bolus_history.csv: Event stream of all bolus delivery events. Standard boluses are assumed to be delivered immediately.

Column Name	Type	Description
patient_id	str	Patient ID
datetime	pd.Timestamp	Datetime of the bolus event
bolus	float	Actual delivered bolus amount in units
delivery_duration	pd.Timedelta	Duration of the bolus delivery

Basal Rates

basal_history.csv:Event stream of basal rates, accounting for temporary basal adjustments, pump suspends, and closed-loop modes. The basal rates are active until the next rate is reported.

Column Name	Type	Description
patient_id	str	Patient ID
datetime	pd.Timestamp	Datetime of the basal rate start event
basal_rate	float	Basal rate in units per hour

CGM (Continuous Glucose Monitor)

cgm_history.csv: Event stream of CGM values.

Column Name	Type	Description
patient_id	str	Patient ID
datetime	pd.Timestamp	Datetime of the CGM measurement
cgm	float	CGM value in mg/dL

Output Files:

For each study, the dataframes are saved in the data/out/<study-name>/ folder: - To reduce file size, the data is saved in a compressed format using the gzip - datetimes and timedeltas are saved as unix timestamps (seconds) and integers (seconds) respectively. - boluses and basals are rounded to 4 decimal places - cgm values are converted to integers

`main(load_subset=False, remove_repetitive=True, output_format='parquet', compressed=False, input_dir=None, output_dir=None)`

Main function to process study data folders.

Parameters:

Name	Type	Description	Default
`load_subset`	`bool`	If True, runs the script on a limited amount of data (e.g. skipping rows).	`False`
`output_format`	`str`	The format to save the output files ('csv' or 'parquet').	`'parquet'`
`compressed`	`bool`	Whether to compress the output files.	`False`
`input_dir`	`str`	Custom input directory path. Defaults to 'data/raw'.	`None`
`output_dir`	`str`	Custom output directory path. Defaults to 'data/out'.	`None`

Logs

Information about the current working directory and paths being used.
Warnings for folders that do not match any known study patterns.
Errors if no supported studies are found.
Progress of processing each matched study folder.

`process_folder(study, out_path_study, progress, load_subset, remove_repetitive, output_format, compressed)`

Processes the data for a given study by loading, extracting, and resampling bolus, basal, and glucose events.

Parameters:

Name	Type	Description	Default
`study`	`object`	An instance of a study class that contains methods to load and extract data.	required
`out_path_study`	`str`	The output directory path where the processed data will be saved.	required
`progress`	`tqdm`	A tqdm progress bar object to display the progress of the processing steps.	required
`output_format`	`str`	The format to save the output files ('csv' or 'parquet').	required
`compressed`	`bool`	Whether to compress the output files.	required

Steps

Loads the study data.
Extracts bolus event history and saves it as a file.
Extracts basal event history and saves it as a file.
Extracts continuous glucose monitoring (CGM) history and saves it as a file. Each step updates the progress bar and logs the current status.

`studies.studydataset`

`StudyDataset`

The StudyDataset class is designed to represent a clinical diabetes dataset with continuous glucose monitoring and insulin delivery data in the form of boluses and basal rates. By subclassing and implementing the required methods, it can be used to extract continuous glucose monitoring (CGM) data, bolus event history, and basal event history from a dataset.

The following private methods need to be implemented by subclasses: - _load_data: This method should load the data from the study directory. - _extract_bolus_event_history: This method should extract the bolus event history from the dataset. - _extract_basal_event_history: This method should extract the basal event history from the dataset. - _extract_cgm_history: This method should extract the CGM measurements from the dataset.

Output Validation: While subclasses should implement the private methods, the extraction methods should not be overridden. Instead, the output of these methods is validated using decorators. To extract the data, the extract_bolus_event_history, extract_basal_event_history, and extract_cgm_history methods should be called. These methods will call the private methods and validate the output.

`extract_basal_event_history()`

Uses _extract_basal_event_history to extract the basal event history, perform type checking, and cache the result.

Warning

Don't override this: This method does do type checking on the output data and should not be overriden by subclasses. Instead, subclasses should implement the _extract_basal_event_history method.

Notes

Include zero basal rates: The assumption is that basal rates continue until a new rate is set. Therefore, zero basal rates should be included in the output.
Account for suspend and temporary basal events.
Ensure the datetime object is a pandas datetime object and is of type datetime64[ns], otherwise the validation will fail e.g. by using df.

Returns:

Name	Type	Description
`basal_event_history`	`DataFrame`	The basal event history with the following columns: `patient_id` (String): the unique patient ID `datetime` (pandas.datetime): the date and time of the basal event `basal_rate` (float): the basal rate in units per hour. Make sure to include zero basal rates as they mark basal suspends.

`extract_bolus_event_history()`

Extract bolus event history from the dataset, perform type checking, and cache the result.

Notes:
For standard boluses the delivery duration is 0 seconds, for extended boluses, these are the duration of the extended delivery.

Warning:
Don't override this: This method does do type checking on the output data and should not be overriden by subclasses. Instead, subclasses should implement the _extract_bolus_event_history method.

Returns:

Name	Type	Description
`bolus_events`	`DataFrame`	The bolus event history with the following columns: `patient_id` (String): the unique patient ID `datetime` (pandas.datetime): the date and time of the bolus event `bolus` (float): the bolus amount in units `delivery_duration` (pandas.timedelta): the duration of the bolus delivery: For standard boluses the delivery duration is 0 seconds, for extended boluses, these are the duration of the extended delivery.

`extract_cgm_history()`

Extract cgm measurements from the dataset, perform type checking, and cache the result.

Warning

Don't override this! This method does do type checking on the output data and should not be overriden by subclasses. Instead, subclasses should implement the _extract_cgm_history method.

Returns:

Name	Type	Description
`cgm_measurements`	`DataFrame`	A DataFrame containing the cgm measurements. The DataFrame should have the following columns: `patient_id`: A string representing the patient ID `datetime`: A pandas datetime object representing the date and time of the cgm measurement `cgm`: A float representing the cgm value in mg/dL

`load_data(subset=False)`

Load and cache the study data into memory by calling the _load_data method which should be implemented by subclasses.

This method is automatically called when calling one of the extraction methods. However, it can also be called up-front. After data was loaded the member variable data_loaded is set to True and subsequent calls to this method will not reload the data.

Notes

Don't override this: This method does do type checking on the output data and should not be overriden by subclasses. Instead, subclasses should implement the _extract_bolus_event_history method.

Parameters:

Name	Type	Description	Default
`subset`	`bool`	Should only load a small subset of the data for testing purposes. Defaults to False.	`False`

`validate_basal_output_dataframe(func)`

A decorator to validate the output of a function to ensure it is a pandas DataFrame with specific required columns and data types.

It is used to validate the output of the extract_basal_event_history method in the StudyDataset class. Subclasses should implement the _extract_basal_event_history method which is called by the extract_basal_event_history method to use this decorator.

The DataFrame must have the following columns (see output format in the extract_basal_event_history method): - 'patient_id': of type string - 'datetime': of type pandas datetime (datetime64[ns]). - 'basal_rate': of numeric type

Raises:

Type	Description
`TypeError`	If the output is not a pandas DataFrame.
`ValueError`	If the DataFrame does not have the required columns.
`ValueError`	If the 'datetime' column is not of type pandas datetime (datetime64[ns]).
`ValueError`	If the 'patient_id' column is not of type string.
`ValueError`	If the 'basal_rate' column is not of numeric type.

Parameters:

Name	Type	Description	Default
`func`	`function`	The function whose output will be validated.	required

Returns:

Name	Type	Description
`function`	`function`	The wrapped function with validation applied to its output.

`validate_bolus_output_dataframe(func)`

A decorator to validate the output of a function that returns a pandas DataFrame. It is used to validate the output of the extract_bolus_event_history method in the StudyDataset class. Subclasses should implement the _extract_bolus_event_history method which is called by the extract_bolus_event_history method to use this decorator.

The DataFrame must have the following (see output format in the extract_bolus_event_history method): - 'patient_id': of type string - 'datetime': of type pandas datetime - 'bolus': of type float - 'delivery_duration': of type pandas timedelta

Raises:

Type	Description
`TypeError`	If the output is not a pandas DataFrame.
`ValueError`	If the DataFrame does not have the required columns.
`ValueError`	If the 'datetime' column is not of type pandas datetime (datetime64[ns]).
`ValueError`	If the 'patient_id' column is not of type string.
`ValueError`	If the 'bolus' column is not of type float.
`ValueError`	If the 'delivery_duration' column is not of type pandas timedelta.

Parameters:

Name	Type	Description	Default
`func`	`callable`	The function to be decorated.	required

Returns:

Name	Type	Description
`function`	`function`	The wrapped function with validation applied to its output.

`validate_cgm_output_dataframe(func)`

A decorator to validate the output of a function to ensure it is a pandas DataFrame with specific required columns and data types.

It is used to validate the output of the extract_cgm_history method in the StudyDataset class. Subclasses should implement the _extract_cgm_history method which is called by the extract_cgm_history method to use this decorator.

The DataFrame must have the following columns (see output format in the extract_cgm_history method): - 'patient_id': of type string - 'datetime': of type pandas datetime (datetime64[ns]). - 'cgm': of numeric type

Raises:

Type	Description
`TypeError`	If the output is not a pandas DataFrame.
`ValueError`	If the DataFrame does not have the required columns.
`ValueError`	If the 'datetime' column is not of type pandas datetime (datetime64[ns]).
`ValueError`	If the 'patient_id' column is not of type string.
`ValueError`	If the 'cgm' column is not of numeric type.

Parameters:

Name	Type	Description	Default
`func`	`function`	The function whose output will be validated.	required

Returns:

Name	Type	Description
`function`	`function`	The wrapped function with validation applied to its output.

`studies.iobp2.IOBP2`

Bases: StudyDataset

`studies.flair.Flair`

Bases: StudyDataset

`get_reported_tdds(method='max')`

Retrieves reported total daily doses (TDDs) based on the specified method.

Parameters:

Name	Type	Description	Default
`method`	`str`	The method to use for retrieving the TDDs. - 'max': Returns the TDD with the maximum reported value for each patient and date. - 'sum': Returns the sum of all reported TDDs for each patient and date. - 'latest': Returns the TDD with the latest reported datetime for each patient and date. - 'all': Returns all TDDs without any grouping or filtering.	`'max'`

Returns:

Type	Description
`DataFrame`	The DataFrame containing the retrieved TDDs based on the specified method.

Raises:

Type	Description
`ValueError`	If the method is not one of: 'max', 'sum', 'latest', 'all'.

`studies.pedap.PEDAP`

Bases: StudyDataset

`studies.dclp.DCLP3`

Bases: StudyDataset

`studies.dclp.DCLP5`

Bases: DCLP3

`studies.loop.Loop`

Bases: StudyDataset

`studies.t1dexi.T1DEXI`

Bases: StudyDataset

`studies.t1dexi.T1DEXIP`

Bases: T1DEXI

`studies.replacebg.ReplaceBG`

Bases: StudyDataset

`src`

`cdf`

`get_cdf(data)`

Get the Cumulative Distribution Function (CDF) of a data array.

Parameters: data (array-like): The data array for which the CDF is to be calculated.

tuple: A tuple containing two elements: - data_sorted (array-like): The sorted data array. - cdf (array-like): The CDF values.

`plot_cdf(data, title='CDF', xlabel='Value', ylabel='CDF', ax=None, **kwargs)`

Plots the Cumulative Distribution Function (CDF) of a data array.

Parameters: data (array-like): The data array for which the CDF is to be plotted. title (str): The title of the plot. xlabel (str): The label for the x-axis. ylabel (str): The label for the y-axis.

`date_helper`

`convert_duration_to_timedelta(duration)`

Parse a duration string in the format "hours:minutes:seconds" and return a timedelta object. Args: duration_str (str): The duration string to parse in the form of "hours:minutes:seconds". Returns: timedelta: A timedelta object representing the parsed duration.

`parse_flair_dates(dates, format_date='%m/%d/%Y', format_time='%I:%M:%S %p')`

Optimized parsing of date strings with or without time components.

`drawing`

`create_axis()`

Creates a new figure and axis for plotting.

Returns:

Name	Type	Description
`figure`	`Figure`	The created figure.
`axes`	`Axes`	The created axis.

`drawAbsoluteBasalRates(ax, datetimes, rate, **kwargs)`

Draws the absolute basal rates on the given axes.

Parameters:

Name	Type	Description	Default
`ax`	`Axes`	The axes on which to draw the basal rates.	required
`datetimes`	`array - like`	An array of datetime objects representing the time points.	required
`rate`	`array - like`	An array of basal rates corresponding to the time points.	required
`**kwargs`	`dict`	Additional keyword arguments to customize the plot. Possible keys include: 'hatch' (str): The hatch pattern for the plot. Default is '//'. 'label' (str): The label for the plot. Default is 'true basal rate'. 'edgecolor' (str): The edge color for the plot. Default is 'black'.	`{}`

`drawBasal(ax, datetimes, rates, color=colors['Basal'], **kwargs)`

Draws the basal rates on the given axes.

Parameters:

Name	Type	Description	Default
`ax`	`Axes`	The axes on which to draw the basal rates.	required
`datetimes`	`list of datetime`	List of datetime objects representing the time points.	required
`rates`	`list of float`	List of basal rates corresponding to the datetime points.	required
`color`	`str`	Color for the basal rates plot. Defaults to colors['Basal'].	`colors['Basal']`
`**kwargs`	`dict`	Additional keyword arguments to customize the plot.	`{}`

`drawBoluses(ax, datetimes, boluses, **kwargs)`

Draws insulin boluses events on a given matplotlib axis.

Parameters:

Name	Type	Description	Default
`ax`	`Axes`	The axis on which to draw the boluses.	required
`datetimes`	`list of datetime.datetime`	List of datetime objects representing the times of the boluses.	required
`boluses`	`list of float`	List of bolus values corresponding to the datetimes.	required
`**kwargs`	`dict`	Additional keyword arguments passed to the ax.bar() method.	`{}`

`drawCGM(ax, datetimes, values, color=colors['CGM'], unit='mg/dL', **kwargs)`

Draws CGM (Continuous Glucose Monitoring) data on the given axes.

Parameters:

Name	Type	Description	Default
`ax`	`Axes`	The axes on which to draw the CGM data.	required
`datetimes`	`list of datetime`	List of datetime objects representing the time points.	required
`values`	`list of float`	List of glucose values corresponding to the datetime points.	required
`color`	`str`	Color for the CGM plot. Defaults to colors['CGM'].	`colors['CGM']`
`**kwargs`	`dict`	Additional keyword arguments to customize the plot.	`{}`

`drawExtendedBoluses(ax, datetimes, boluses_units, duration, color=colors['Bolus'], **kwargs)`

Draws extended boluses on the given axes.

Parameters:

Name	Type	Description	Default
`ax`	`Axes`	The axes on which to draw the boluses.	required
`datetimes`	`list of datetime`	List of datetime objects representing the times of the boluses.	required
`boluses_units`	`list of float`	List of bolus units corresponding to each datetime.	required
`duration`	`list of numpy.timedelta`	List of delivery duration for each bolus.	required
`color`	`str`	Color of the boluses. Default is colors['Bolus'].	`colors['Bolus']`
`**kwargs`	`dict`	Additional keyword arguments to pass to the bar function.	`{}`

`drawSuspendTimes(ax, start_date, duration)`

Draws a bar on the given axis to represent suspend times.

Parameters:

Name	Type	Description	Default
`ax`	`Axes`	The axis on which to draw the bar.	required
`start_date`	`datetime - like`	The starting date and time for the bar.	required
`duration`	`timedelta`	The duration for which the bar extends.	required

`drawTempBasal(ax, datetimes, temp_basal_rates, temp_basal_durations, temp_basal_types, color=colors['Basal'], **kwargs)`

Draws temporary basal rates on the given axes.

Parameters:

Name	Type	Description	Default
`ax`	`Axes`	The axes on which to draw the temporary basal rates.	required
`datetimes`	`list of datetime`	List of datetime objects representing the times of the temporary basal rates.	required
`temp_basal_rates`	`list of float`	List of temporary basal rates corresponding to the datetimes.	required
`temp_basal_durations`	`list of numpy.timedelta`	List of temporary basal durations corresponding to the datetimes.	required
`color`	`str`	Color of the temporary basal rates. Default is colors['Basal'].	`colors['Basal']`
`**kwargs`	`dict`	Additional keyword arguments passed to the ax.bar() method.	`{}`

`draw_presence_matrix(ax, df, x_col, y_col, offset=0, **kwargs)`

Scatter plots unique available x_values for y_col groups in a DataFrame. Args: ax (matplotlib.axes.Axes): The matplotlib Axes object to plot on. df (pd.DataFrame): The input DataFrame containing the data to plot. x_col (str): The column name in the DataFrame representing the x-axis values (e.g., datetime). y_col (str): The column name in the DataFrame representing the y-axis values used for grouping (e.g., patient IDs). offset (int, optional): An offset to apply to the y-axis values. Defaults to 0. **kwargs: Additional keyword arguments to pass to the ax.scatter method. Returns: None

`parse_duration(duration_str)`

Parses a duration string in the format "HH:MM:SS" and returns a timedelta object.

Parameters:

Name	Type	Description	Default
`duration_str`	`str`	A string representing the duration in the format "HH:MM:SS".	required

Returns: timedelta: A timedelta object representing the parsed duration.

`file_saver`

`save_dataframe(df, out_path, output_format, compressed, study_name, data_type)`

Save a DataFrame to a specified format (CSV or Parquet) with optional compression.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The DataFrame to save.	required
`out_path`	`str`	The base directory for the output files.	required
`output_format`	`str`	The output format ('csv' or 'parquet').	required
`compressed`	`bool`	If True, reduces resolution and compressses the output file (for CSV).	required
`study_name`	`str`	The name of the study.	required
`data_type`	`str`	The type of data being saved (e.g., 'cgm', 'bolus', 'basal').	required

`save_to_csv(df, file_path, compressed)`

Save a pandas DataFrame to a CSV file. The file can be compressed using gzip.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The DataFrame to save.	required
`file_path`	`str`	The path to the output file.	required
`compressed`	`bool`	If True, the output file will be compressed using gzip.	required

`save_to_parquet_partitioned(df, base_path, study_name, data_type)`

Save a pandas DataFrame to Parquet files, partitioned by specified columns.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The DataFrame to save.	required
`base_path`	`str`	The base directory for the output files.	required
`study_name`	`str`	The name of the study.	required
`data_type`	`str`	The type of data being saved.	required

`find_periods`

`find_periods(df, value_col, time_col, start_trigger_fun, stop_trigger_fun, use_last_start_occurence=False)`

Find periods in a DataFrame based on start and stop triggers.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The DataFrame to search for periods.	required
`value_col`	`str`	The name of the column containing the trigger values.	required
`time_col`	`str`	The name of the column containing the time values.	required
`start_trigger_fun`	`callable`	The value that indicates the start of a period.	required
`stop_trigger_fun`	`callable`	The value that indicates the end of a period.	required
`use_last_start_occurence`	`bool`	If True, the last occurrence of the start trigger will be used.	`False`

Returns:

Name	Type	Description
`list`	`list`	A list of named tuples representing the periods found. Each namedtuple contains the following attributes: - start_index (int): The index of the start trigger in the DataFrame. - end_index (int): The index of the stop trigger in the DataFrame. - start_time: The time value of the start trigger. - end_time: The time value of the stop trigger.

`logger`

`Logger`

`get_logger(name, level=logging.DEBUG)` `staticmethod`

Returns a configured logger instance with the specified name and log level.

Parameters:

Name	Type	Description	Default
`name`	`str`	The name of the logger.	required
`level`	`int`	The logging level (e.g., logging.INFO, logging.DEBUG).	`DEBUG`

Returns:

Type	Description
`Logger`	logging.Logger: Configured logger.

`pandas_helper`

`count_differences_in_duplicates(df, subset)`

Counts the number of differences between duplicated rows for all columns.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The input DataFrame.	required

Returns:

Name	Type	Description
`series`	`Series`	A series where the index represents column names and values represent the count of differences.

`extract_surrounding_rows(df, index, n, sort_by)`

Extracts rows surrounding a given index after sorting the DataFrame by a subset of columns.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The input DataFrame.	required
`index`	`int`	The row index to center on.	required
`n`	`int`	The number of rows before and after the given index to extract (using logical indexing).	required
`sort_by`	`list`	List of column names to sort the DataFrame by.	required

Returns: pd.DataFrame: A DataFrame containing the extracted rows.

`get_df(path, usecols=None, subset=False, dtype=None)`

Reads a data file from a given path, handling both standard file formats and files within ZIP archives.

Parameters:

Name	Type	Description	Default
`path`	`str`	The file path or a path to a file inside a ZIP archive.	required
`usecols`	`list`	List of column names to include in the df.	`None`
`subset`	`bool`	If True, read only a subset of the data for lightweight testing.	`False`
`dtype`	`dict`	Data types to enforce for specific columns.	`None`

Returns:

Type	Description
	pd.DataFrame: The loaded data as a Pandas DataFrame.

`get_duplicated_max_indexes(df, check_cols, max_col)`

Find duplicate indexes, maximum indexes, and indexes to drop in a dataframe.

Args: df (pd.DataFrame): The dataframe to check for duplicates. check_cols (list): The columns to check for duplicates. max_col (str): The column to use for keeping the maximum value.

tuple: A tuple containing three elements: - duplicated_indexes (np.array): Indexes of duplicated rows. - max_indexes (np.array): Indexes of rows with the maximum value in the max_col. - drop_indexes (np.array): Indexes of rows to drop.

Example

Example usage get duplicated max indexes

df = pd.DataFrame({ 'PtID': [1, 1, 1, 2, 2, 2, 3, 3, 3, 1], 'DataDtTm': [1, 2, 3, 1, 2, 2, 1, 1, 1, 2], 'CGMValue': [1, 2, 3, 1, 2, 3, 4, 2, 3, 3] }) dup_indexes, max_indexes, drop_indexes = get_duplicated_max_indexes(df, ['PtID', 'DataDtTm'], 'CGMValue') print(df.drop(drop_indexes))

`grouped_value_counts(df, group_cols, value_cols)`

Count the number of NaN, Non-NaN, and Zero values in each group of a DataFrame.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The input DataFrame.	required
`group_cols`	`str or list`	The column(s) to group by.	required
`value_cols`	`str or list`	The column(s) to count values for.	required

Returns:

Name	Type	Description
`dataframe`	`DataFrame`	A DataFrame containing the count of NaN, Non-NaN, and Zero values for each group.

`head_tail(df, n=2)`

Returns the first n rows and the last n rows of a DataFrame.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The DataFrame to get the head and tail of.	required
`n`	`int`	The number of rows to return from the head and tail of the DataFrame.	`2`

Returns:

Name	Type	Description
`dataframe`	`DataFrame`	A new pandas dataframe containing The first n rows of the DataFrame. The last n rows of the DataFrame.

`overlaps(df, datetime_col, duration_col)`

Check for overlapping intervals in a DataFrame.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	A DataFrame containing at least two columns: - 'datetime_col': Start times of the intervals. - 'duration_col': Durations of the intervals.	required

Returns: pd.Series: A boolean Series indicating whether each interval overlaps with the next interval

`repetitive(df, datetime_col, value_col, max_duration)`

Get the indexes of repetitive values in a DataFrame based on a datetime column and a value column. Args: df (pd.DataFrame): The DataFrame to process. datetime_col (str): The name of the datetime column. value_col (str): The name of the value column. max_duration (timedelta, optional): To prevent long gaps between values, this parameter is used define the max duration for which consecutive values are dropped. At least one value will be kept whenever duration exceeds tha map_duration.

Returns:

Name	Type	Description
`tuple`		A tuple containing three elements: - i_all_rep (np.array): Indexes of all repetitive values. - i_keep (np.array): Indexes of the first occurrence of repetitive values. - i_drop (np.array): Indexes of values to drop (to remove repetitive values after the first occurrence).

`split_groups(x, threshold)`

Assigns unique group IDs based on the distance between consecutive values.

Parameters:

Name	Type	Description	Default
`x`	`Series`	Series of numerical values.	required
`threshold`		The maximum duration between two consecutive values to consider them in the same group.	required

Returns:

Type	Description
`Series`	The Series containing the data.

Example

df = pd.DataFrame({'sensor': ['a', 'a', 'b', 'b', 'a', 'a', 'a', 'a', 'b', 'b', 'b', 'b'], 'y': [0, 1, 2, 3, 10, 11, 12, 13, 50, 51, 70, 71]}) df['sensor_session'] = df.groupby('sensor').y.transform(lambda x: split_groups(x, 5)) start_ends = df.groupby(['sensor', 'sensor_session']).y.agg(['idxmin','idxmax']).reset_index()

`split_sequences(df, label_col)`

Assigns a unique group ID to each sequence of consecutive labels.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The DataFrame containing the data.	required
`label_col`	`str`	The column name for the labels.	required

Returns:

Name	Type	Description
`group_ids`	`Series`	The group IDs.

Example

df = pd.DataFrame({'label': ['A', 'A', 'B', 'B', 'B', 'A', 'A', 'C', 'C', 'A']}) df['sequence'] = split_sequences(df, 'label') print(df) start_ends = df.groupby(['label', 'sequence']).apply(lambda group: pd.Series({ 'idxmin': group.index.min(), 'idxmax': group.index.max() }),include_groups=False).reset_index() print(start_ends)

`postprocessing`

`basal_transform(basal_data)`

Transform the basal data by aligning timestamps and handling duplicates.

Parameters:

Name	Type	Description	Default
`basal_data`	`DataFrame`	The input is a basal data dataframe containing columns 'datetime', and 'basal_rate'.	required

Returns:

Name	Type	Description
`basal_data`	`DataFrame`	The transformed basal equivalent deliveries with aligned timestamps and duplicates removed.

`bolus_transform(df)`

Transform the bolus data by aligning timestamps, handling duplicates, and extending boluses based on durations.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The input is a bolus data dataframe containing columns 'datetime', 'bolus', and 'delivery_duration'.	required

Returns:

Name	Type	Description
`bolus_data`	`DataFrame`	5 Minute resampled and time aligned at midnight bolus data with columns: datetime, delivery

`cgm_transform(cgm_data)`

Time aligns the cgm data to midnight with a 5 minute sampling rate.

Parameters:

Name	Type	Description	Default
`cgm_data`	`DataFrame`	The input is a cgm data dataframe containing columns 'datetime', and 'cgm'.	required

Returns:

Name	Type	Description
`cgm_data`	`DataFrame`	The transformed cgm data with aligned timestamps.

`compress_dataframe_storage(df)`

Reduces DataFrame size by converting columns to appropriate types. This function is useful for optimizing memory usage when saving DataFrames to disk.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The DataFrame to save.	required
`out_path`	`str`	The base directory for the output files.	required
`output_format`	`str`	The output format ('csv' or 'parquet').	required
`compressed`	`bool`	If True, compress the output file (for CSV).	required
`study_name`	`str`	The name of the study.	required
`data_type`	`str`	The type of data being saved (e.g., 'cgm', 'bolus', 'basal').	required

Example CSV Output for cgm

patient_id,datetime,cgm
10,1524150016,88
10,1524150270,85
10,1524150568,81

`tdd`

`calculate_daily_basal_dose(df)`

Calculate the Total Daily Dose (TDD) of basal insulin for each day in the given DataFrame.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The DataFrame containing the insulin data.	required

Returns:

Name	Type	Description
`tdds`	`DataFrame`	dataframe with two columns: `date` and `dose` golding the daily total basal dose.

Required Column Names

datetime: The timestamp of each basal insulin rate event.
basal_rate: The basal insulin rate event [U/hr].

`calculate_daily_bolus_dose(df)`

Calculate the daily bolus dose for each patient. Parameters: df (pandas.DataFrame): The input DataFrame containing the following columns: - datetime (datetime): The date and time of the bolus dose. - bolus (float): The amount of bolus dose. Returns: pandas.DataFrame: A DataFrame with the daily bolus dose for each patient, grouped by patient_id and date.

`calculate_tdd(df_bolus, df_basal)`

Calculates the total daily dose (TDD) by merging the daily basal dose and daily bolus dose. Parameters: df_bolus (DataFrame): DataFrame containing the bolus dose data. - patient_id (int): The ID of the patient. - datetime (datetime): The date and time of the bolus dose. - bolus (float): The amount of bolus dose. df_basal (DataFrame): DataFrame containing the basal dose data. - patient_id (int): The ID of the patient. - datetime (datetime): The date and time of the basal dose. - basal_rate (float): The basal insulin rate event [U/hr]. Returns: tdd (DataFrame): DataFrame containing both the bolus and basal tdd data.

`durations_since_previous_valid_value(dates, values)`

Calculate the durations between each date and the previous date with a valid value (non NaN).

Parameters: dates (list): A list of dates. values (list): A list of values.

Returns:

Name	Type	Description
`durations`	`list`	A list of durations between each date and the previous valid date. NaN if there is no previous valid date.

`total_delivered(df, datetime_col, rate_col)`

Calculate the total delivered insulin over the time intervals in the given DataFrame.

Code Reference

run_functions

Output format:

Boluses

Basal Rates

CGM (Continuous Glucose Monitor)

Output Files:

main(load_subset=False, remove_repetitive=True, output_format='parquet', compressed=False, input_dir=None, output_dir=None)

process_folder(study, out_path_study, progress, load_subset, remove_repetitive, output_format, compressed)

studies.studydataset

StudyDataset

extract_basal_event_history()

extract_bolus_event_history()

extract_cgm_history()

load_data(subset=False)

validate_basal_output_dataframe(func)

validate_bolus_output_dataframe(func)

validate_cgm_output_dataframe(func)

studies.iobp2.IOBP2

studies.flair.Flair

get_reported_tdds(method='max')

studies.pedap.PEDAP

studies.dclp.DCLP3

studies.dclp.DCLP5

studies.loop.Loop

studies.t1dexi.T1DEXI

studies.t1dexi.T1DEXIP

studies.replacebg.ReplaceBG

src

cdf

get_cdf(data)

plot_cdf(data, title='CDF', xlabel='Value', ylabel='CDF', ax=None, **kwargs)

date_helper

convert_duration_to_timedelta(duration)

parse_flair_dates(dates, format_date='%m/%d/%Y', format_time='%I:%M:%S %p')

drawing

create_axis()

drawAbsoluteBasalRates(ax, datetimes, rate, **kwargs)

drawBasal(ax, datetimes, rates, color=colors['Basal'], **kwargs)

drawBoluses(ax, datetimes, boluses, **kwargs)

drawCGM(ax, datetimes, values, color=colors['CGM'], unit='mg/dL', **kwargs)

drawExtendedBoluses(ax, datetimes, boluses_units, duration, color=colors['Bolus'], **kwargs)

drawSuspendTimes(ax, start_date, duration)

drawTempBasal(ax, datetimes, temp_basal_rates, temp_basal_durations, temp_basal_types, color=colors['Basal'], **kwargs)

draw_presence_matrix(ax, df, x_col, y_col, offset=0, **kwargs)

parse_duration(duration_str)

file_saver

save_dataframe(df, out_path, output_format, compressed, study_name, data_type)

save_to_csv(df, file_path, compressed)

save_to_parquet_partitioned(df, base_path, study_name, data_type)

find_periods

find_periods(df, value_col, time_col, start_trigger_fun, stop_trigger_fun, use_last_start_occurence=False)

logger

Logger

get_logger(name, level=logging.DEBUG) staticmethod

pandas_helper

count_differences_in_duplicates(df, subset)

extract_surrounding_rows(df, index, n, sort_by)

get_df(path, usecols=None, subset=False, dtype=None)

get_duplicated_max_indexes(df, check_cols, max_col)

Example usage get duplicated max indexes

grouped_value_counts(df, group_cols, value_cols)

head_tail(df, n=2)

overlaps(df, datetime_col, duration_col)

repetitive(df, datetime_col, value_col, max_duration)

split_groups(x, threshold)

split_sequences(df, label_col)

postprocessing

basal_transform(basal_data)

bolus_transform(df)

cgm_transform(cgm_data)

compress_dataframe_storage(df)

tdd

calculate_daily_basal_dose(df)

calculate_daily_bolus_dose(df)

calculate_tdd(df_bolus, df_basal)

durations_since_previous_valid_value(dates, values)

total_delivered(df, datetime_col, rate_col)

`run_functions`

`main(load_subset=False, remove_repetitive=True, output_format='parquet', compressed=False, input_dir=None, output_dir=None)`

`process_folder(study, out_path_study, progress, load_subset, remove_repetitive, output_format, compressed)`

`studies.studydataset`

`StudyDataset`

`extract_basal_event_history()`

`extract_bolus_event_history()`

`extract_cgm_history()`

`load_data(subset=False)`

`validate_basal_output_dataframe(func)`

`validate_bolus_output_dataframe(func)`

`validate_cgm_output_dataframe(func)`

`studies.iobp2.IOBP2`

`studies.flair.Flair`

`get_reported_tdds(method='max')`

`studies.pedap.PEDAP`

`studies.dclp.DCLP3`

`studies.dclp.DCLP5`

`studies.loop.Loop`

`studies.t1dexi.T1DEXI`

`studies.t1dexi.T1DEXIP`

`studies.replacebg.ReplaceBG`

`src`

`cdf`

`get_cdf(data)`

`plot_cdf(data, title='CDF', xlabel='Value', ylabel='CDF', ax=None, **kwargs)`

`date_helper`

`convert_duration_to_timedelta(duration)`

`parse_flair_dates(dates, format_date='%m/%d/%Y', format_time='%I:%M:%S %p')`

`drawing`

`create_axis()`

`drawAbsoluteBasalRates(ax, datetimes, rate, **kwargs)`

`drawBasal(ax, datetimes, rates, color=colors['Basal'], **kwargs)`

`drawBoluses(ax, datetimes, boluses, **kwargs)`

`drawCGM(ax, datetimes, values, color=colors['CGM'], unit='mg/dL', **kwargs)`

`drawExtendedBoluses(ax, datetimes, boluses_units, duration, color=colors['Bolus'], **kwargs)`

`drawSuspendTimes(ax, start_date, duration)`

`drawTempBasal(ax, datetimes, temp_basal_rates, temp_basal_durations, temp_basal_types, color=colors['Basal'], **kwargs)`

`draw_presence_matrix(ax, df, x_col, y_col, offset=0, **kwargs)`

`parse_duration(duration_str)`

`file_saver`

`save_dataframe(df, out_path, output_format, compressed, study_name, data_type)`

`save_to_csv(df, file_path, compressed)`

`save_to_parquet_partitioned(df, base_path, study_name, data_type)`

`find_periods`

`find_periods(df, value_col, time_col, start_trigger_fun, stop_trigger_fun, use_last_start_occurence=False)`

`logger`

`Logger`

`get_logger(name, level=logging.DEBUG)` `staticmethod`

`pandas_helper`

`count_differences_in_duplicates(df, subset)`

`extract_surrounding_rows(df, index, n, sort_by)`

`get_df(path, usecols=None, subset=False, dtype=None)`

`get_duplicated_max_indexes(df, check_cols, max_col)`

`grouped_value_counts(df, group_cols, value_cols)`

`head_tail(df, n=2)`

`overlaps(df, datetime_col, duration_col)`

`repetitive(df, datetime_col, value_col, max_duration)`

`split_groups(x, threshold)`

`split_sequences(df, label_col)`

`postprocessing`

`basal_transform(basal_data)`

`bolus_transform(df)`

`cgm_transform(cgm_data)`

`compress_dataframe_storage(df)`

`tdd`

`calculate_daily_basal_dose(df)`

`calculate_daily_bolus_dose(df)`

`calculate_tdd(df_bolus, df_basal)`

`durations_since_previous_valid_value(dates, values)`

`total_delivered(df, datetime_col, rate_col)`