Replace BG

This page summarizes our insights about the clinical study data of the Replace BG study in efforts to understand how to handle bolus, basal, and cgm data, lists assumptions that were made, and poses open questions.

The full analysis of this dataset is provided in: notebooks/understand-replacebg-dataset/understand-replace-bg.ipynb

Study Overview

Study Name: A Randomized Trial Comparing Continuous Glucose Monitoring With and Without Routine Blood Glucose Monitoring in Adults with Type 1 Diabetes
Background: The primary objective of the study was to determine whether the routine use of CGM without BGM confirmation is as safe and effective as CGM used as an adjunct to BGM.
Duration: Run-in phase of 2–10 weeks, 26 weeks study duration
Devices: Dexcom G4 Platinum
Population: There are 208 patients with bolus, basal and cgm data.
Data: There are roughly 52683 patient days of data with an average of 253 days per patient (using cgm).

Data

“Data were uploaded from the study CGM and BGM devices and the participant’spersonal insulin pump by using the Tidepool platform (http://tidepool.org). For insulin pumps that were unable to be uploaded to the Tidepool platform, the data were obtained by using Diasend (Chicago, IL) software” (Aleppo et al., 2017, p. 540) (pdf)

The study data folder is named Loop study public dataset 2023-01-31

From the ReadMe.rtf file, the following relevant files were identified which are stored in the Data Tables subfolder.

File Name	Description	Note
HDeviceCGM.txt	One record per CGM reading	Similar to what we've seen in Loop (Tidepool structure)
HDeviceBasal.txt		Not included in the JAEB file anymore. Similar to what we've seen in Loop (Tidepool structure)
HDeviceBolus.txt	One record per bolus reading from a pump	Similar to what we've seen in Loop (Tidepool structure)
HPtRoster.txt	One record per Protocol H PtID obtained
HDeviceUploads.txt	One record per device upload	Contains information about data source (Tidepool vs. Diasend) which we need to differentiate between duration units in milli seconds (Tidepool) or minutes (Diasend)

Relevant Columns:

The following lists all relevant columns. Other columns were considered irrelevant. Some are still mentioned if they serve the discussion but are crossed through.

HDeviceCGM.txt

Field_Name	Description (Glossary)	Notes
DeviceDtTmDaysFromEnroll	Device date number of days from enrollment	convert to timedelta (unit='days'), add to DeviceTm and arbitrary study start date
DeviceTm	Device time	convert to timedelta (H::M:S)
DexInternalDtTmDaysFromEnroll	Internal date number of days from enrollment	Seems to be Dexcom time, Not missing as in Loop, let's check if needed.
DexInternalTm	Internal time	Seems to be Dexcom time, Not missing as in Loop, let's check if needed.
RecordType	Type of data (CGM, Calibration, etc)	Needed to drop calibrations
GlucoseValue	Glucose value (units: mg/dL)

HDeviceBolus.txt

Field_Name	Description (Glossary)	Notes
DeviceDtTmDaysFromEnroll	Device date number of days from enrollment	convert to timedelta (unit='days'), add to DeviceTm and arbitrary study start date
DeviceTm	Device time	convert to timedelta (H::M:S)
BolusType	Subtype of data (ex: "Normal" and "Square" are subtypes of "Bolus" type)
Normal	Number of units of normal bolus	Likely the only relevant value.
Extended	Number of units for extended delivery	We found that there are 0.4% extended boluses.
Duration	Time span over which the bolus was delivered (milliseconds for Tidepool data, minutes for Diasend data)	Our analysis shows that Duration refers to the Extended part of a Bolus. However, unclear how to find out whether in ms or minutes. JAEB couldn't answer.
ParentHDeviceUploadsID	RecID from tblHDeviceUploads	We need this to join the datasource from the HDeviceUploads table

HDeviceUploads.txt

Field_Name	Description (Glossary)	Notes
PtId		Different spelling (small d) instead of PtID
RecID	Unique record ID in table	This is the upload id which we match with the ParentHDeviceUploadsID in the other tables
DataSource	Source of data (Ex: Tidepool, Diasend, etc.)	We need this to distingusih between Tidepool and Diasend (see diasend vs. tidepool)

HDeviceBasal.txt

Note: File not included in the JAEB file anymore!

Field_Name	Description (Glossary)	Notes
PtID	Participant ID
DeviceDtTmDaysFromEnroll	Device date number of days from enrollment	convert to timedelta (unit='days'), add to DeviceTm and arbitrary study start date
DeviceTm	Device time	convert to timedelta (H::M:S)
BasalType	Basal delivery type	used to find suspends
Duration	Actual number of milliseconds basal will be in effect	Used as duration
Rate	Number of units per hour	We use this as actual delivery rate

Discarded columns

Note: As in the Loop study dataset, we assume that the Bolus columns: ExpectedNormal, ExpectedDuration, ExpectedExtended (Bolus) and the Basal columns: ExpectedDuration, Percent, SuprBasalType, SuprDuration, SuprRate can be ignored as they don't represent the actual deliveries but amounts that were suppressed. However, we will use them to investigate duplicates.

Summary

The tables mostly follow the Tidepool structure which we know from previous tudies (e.g. Loop) Differences:

Timestamps are given relative (day and HH:MM:SS) to enrollment start (which is not provided)
Insulin and InsValue columns exist, however always empty
CGM values in mgdl not mmol/dL
Dexcom times seem to be present in all rows
No timezone offsets are present, probably all in local time

TODO:

Check if Dexcom times exist and if they are needed
Check if duration is in milliseconds (Bolus)

Incomplete Patients

There are 226 unique patients in the patient roster There are 224 unique patients in the bolus table There are 208 unique patients in the basal table There are 226 unique patients in the cgm table

There are 208 patients with data in all datasets. We exclude all other patients.

Datetimes

Datetimes are provided relative to enrollment start by day (DeviceDtTmDaysFromEnroll) and time (DeviceTm). To keep data anonymous, the enrollment date is not provided. Therefore, we chose an arbitrary enrollment date for all patients enrollment_start = datetime(2015,1,1)

From the glossary we know that the run in phase is at max 10 weeks before enrollment while study duration is maximum of 26 weeks. However, we see significant amounts of data exists beyond especially for pump data.

A quick check on cgm data shows that the reconstructed datetimes result in a continuous trace.

Local time

By glossary all datetimes are in local time and the moving average analysis shows charcteristic daily patterns showing post prandial peaks (bolus,cgm). However, basal profiles do not show this trend, likely because we are dealing with a CSII and not a AID system here.

Diasend vs. Tidepool

From the glossary we know that diasend durations are given in minutes instead of ms as in Tidepool. The cdf below confirms this.

Our analysis shows that Diasend durations are in fact in minutes and need to be converted to milliseconds.

there are no Diasend Basals (all Tidepool)
There are only 1060 Diasend boluses with BolusType Combination or Normal
all Combination boluses have a duration, this is given (in minutes)
But only 4 out of 1060 have an extended part

Given that some of them have an extended part it seems more logical that the extended part is missing for the others. Therefore we treat Diasend imports by

1. Adjust from minutes to milliseconds and
2. Set Duration to 0 when extended part is missing.

Data distributions

The data distributions look as expected. Basal rates range from 0-~5.175 U/hr, boluses from 0-35 Units and glucose data from 39-401 mg/dl.

Basal

See note on discarded columns

Suspends & Temp Basals

Temp and Suspend basals are reported as normal basal rates meaning that they are already integrated. (In Flair for example, the suspends needed to be converted into new basal events and we had to use the temp basal rates to change standard basal rates.)

However: - Suspends are reported as NaN basal rates and therefore need to be fileld with zeros so that they are not discarded

df_basal.fillna({'Rate':0}, inplace=True)

Temporary Basals

Basal Duplicates

Some duplicates have the same time, duration, and rate and therefore are equivalent for our purposes even if they show differences in other columns.
Others share the same datetime and duration but different rate.
Others have only same time but different duration / rate.
The meaning of the extra columns (Percent, ExpectedDuration, SuprDuration…) remain unclear.
Investigating this is a big time sink and we probably won’t get it right all the time.

A few examples (same time and rate). Split by which combination of basal types exist.

When there are temporal duplicates in time and duration we could make the following assuptions:

(scheduled or temp) and suspends: prioritize the suspend, set Rate to 0 (using fillna)
scheduled and temp: prioritize temp row
only scheduled: use the maximum value

Duplicates with different durations cause more confusion. Many duplicate sets contain rows whose durations don’t match with the “next” correct basal rate while others do. In these cases i would only keep the one that matches. However, this is not necessarily always the suspend row. Here is an example: Two duplicates at 14:00 with different durations. Only one of the durations (4h) matches with the next non-duplicated row at 18:00. The other, is probably wrong.

While the above approaches might seem plausible, we can't really say that it is correct because

We don't know why there are duplicates
We don't know if we can judge the right row by the duration
We don't know if a suspend really overpowers a scheduled event

Other approaches might be even better such as using the import date etc. However, we could not find any data that would favor one over the other. Ultimately this affects only about 1% of the data.

**Therefore, we go back to a simpler method in resolving duplicates: using the row with the Record ID maximum. **

Basal Durations

We observed many very large basal durations.

Often, long durations have a zero basal rate
Mostly the durations match the data gap of 1 or more days until next basal rate
However, some overlap with the next datetime
Some are excessive e.g. >250 days: In these cases, the data might not even fall within the study period

Potentially, durations were calculated retrospectively from one to the next basal rate causing large values when there are data gaps. However, overlaps cannot be explained by this. The reasons are unclear.

It remains to the user to detect data gaps and outdated basal rates.

CGM

Calibrations need to be dropped.

Special CGM Values

Values surrounding 39md/dl values exist and appear capped as in the image below.

A value_counts() analysis confirms there are much more 39 and 401 readings than 40 and 400 respectively which represent out of range readings.

39.0     28885
40.0      4753
400.0     3530
401.0    48935

In Loop, data was capped to 40-400, in DCLP3 0 values indicated out of bound readings which were replaced with 40 and 400 respectively. To be consistent with the other datasets, we replace 39 and 401 values with 40 and 400 values knowing that in future out of bound readings might require special attention and require a different approach.

df_cgm = df_cgm['GlucoseValue'].replace({39:40, 401:400})

Boluses

Extended Boluses

Each basal row has a Normal and an Extended part. The duration is Nan when there is an extended part. Therefore, to convert it into the target format of a insulin delivery with duration, we need to split rows with both parts in two rows (as we did before in other similar datasets). Normal boluses are assigned a zero duration while extended parts use the duration of the row.

Extended Bolus Durations

Extended bolus duration is within expected limits of <8 hours.
Only extended bolus have a duration (that's good, and means there are no incorrectly-labeled Normal boluses)
Many extended boluses with short durations
- these have very small doses, many of them are 0
Around 1000 extended bolus doses are 0
- These almost always have zero duration.

It remains unclear where the zero durations come from and why we also see extremely short durations.

To avoid unnecessary rows with 0 deliveries, extended parts should be removed: df_bolus = df_bolus.replace({'Normal':0, 'Extended':0}, np.nan)

Diasend Boluses

As discussed in the section Diasend vs. Tidepool data source, Diasend boluses seem to be corrupt. At least the extended parts were almost always empty despite having a delivery duration. We resolve these by treating the Normal part as normal bolus. However, we can not guarantee that this is correct. However, if only affect ~1060 boluses (and even much less after dropping patients with incomplete data).

Requested vs. Delivered

See note on discarded columns

Collected Questions

Is our assumption about irrelevant columns correct?
How should we resolve basal duplicates:
- with equal time and durations but different rates (e.g. scheduled)
- with equal time but different duration?
- why do some have different durations? (some match next time, others don't)
- Should we use suspends over other events? And then, is NaN equal to a 0 basal rate? This is how we treated them in T1DExi
Should we exclude data before and after study start?
- Why is there such an excess of data before enrollment and after study end?
Why are there extended boluses with zero (or very small) dose?
- What are extended boluses with zero duration?
Why do almost all diasend boluses have a duration but no extended part?