Recommendations for Better Datasets

We have spend numerous hours analyzing and understanding clinical diabetes study datasets. In this process we often came accross similar challenges and unknowns. Often, we were not able to resolve these and had to make small or large assumptions. This page summarizes these challenges and provides suggestions for improvements. This should act as a guide for other researchers when processing datasets as well as investigators to improve the quality and utility of their datasets.

Access

Access is often difficult and time-consuming
Approval processes are cumbersome
Data should be click-to-download ready

File Structure and Glossary

Filenames and column names often don't align with the glossary.
One file per data type is preferred.
DCLP3 is particularly problematic: 5 CGM files from different sources with inadequate descriptions.
Avoid large files (e.g. Loop, T1DEXI)
This often requires out of memory computation
Split files by patient
Use parquet over csv (more efficient and persists datatype)

Redundant Data

Multiple columns or files may contain the same information (e.g. 6 CGM files with no added value)
Confusing: which one is correct? Do I need all?
Documentation should clearly specify:
Which files were used
How they were generated
Who created them (device or investigator)

Datetimes

Prefer Local Datetime:
Ideally, local times
If UTC with offset, provide loale to allow acocuntin for Daylight Saving Time (DST) and travel-related shifts.
Device Resets & Time Shifts:
Do not send both "corrected timestamps" and original timestamps. Only the correct one should be shipped.
Ideally, device time resets are already integrated: If you provide them on top, explain if the datetimes are already accounted.
Consistent Format:
Follow ISO 8601 for all datetime formats.
Avoid mixed formats (e.g., date/time separate, relative to study start).
Relative to start can compromise analysis, such as weekends seasonality.

Data Consistency

Avoid using different units in the same column:
- Replace BG bolus durations are given in ms or minutes depending on the data source (resides in a different table).
- Sometimes glucose values are expressed in both/either mg/dl and mmol/dl
Ensure all column unique column values are described (look at the unique values and discuss each single one including NaNs).
- Avoid using different labels for the same meaning e.g. dual vs. combination boluses
- We've seen values that were not documented or missing. This is problematic because filtering by one criteria will miss other rows.
  - Flair: besides the CLOSED_LOOP_MICRO_BOLUS bolus type, there are also rows CL_MICRO_BOLUS
  - T1DEXI: combination vs. dual, standard vs. normal etc.

Special Values & NaNs

Zero unit deliveries are unnecessary to report in boluses. If possible, these should have already been removed. However, in the case of basal rates, 0 mark basal suspends and should obviously be kept.
We often pbserved NaN values that had a meaning. For example when a suspend event occured the basal rate was NaN but in reality was paused and therefore should have been 0.
Clearly document what meaning 0 and NaN values, or other magic numbers have.
Magic CGM Numbers such as 0/39/401 for outside/below/above measurement range should be clearly documented.

Duplicates

Many datasets contain duplicates
Complete duplicates might be dropped but often the doses differ.
Often these are Temporal duplicates (same patient and time) but other values differ:
- Sometimes the basal rate is the same but other columns differ: Which information should we trust?
- some doses differ by a fraction of the value
- Somtimes the doses are completely different
Sometimes the same value is reported twice within close temporal distance

Examples:

Same extended bolus reported with 1 hour distance
Different basal rates at the same time (0 vs. 3, 2 vs. 3)
Almost identical basal rates at the same time but one is rounded: 1.25 vs. 1.3
same value reported with different duration format: 01:00 vs. 1:00 for basal duration
Same MDI injection reported twice within a few seconds: Duplicate or split bolus? (T1DEXi)

We expect this to be a result of:

merging different data sources
device reported a updated entry later on

The risks with duplicates are:

duplicates are overseen and counted twice
wrong duplicate is dropped (at random)
or worse: always the wrong duplicate is kept (e.g. by using maximum value/id, latest,..)

CGM, Basal and Boluses

CGM

Magic CGM Numbers (see above) (i.e. 38 and other values outside of the measurement range 40-400 mg/dl)
Consistent handling of below/above range readings
Below/Above Ranges
- Some data is clamped (Replace BG limited to 39/401 mg/dl), others are encoded as below/above range)
Report Calibrations (separately)
Reduce the amount of files or describe the differences precisely. Which one should be used? Often we saw multiple files (e.g. Dexcom raw CGM, Clarity CGM, CGM from Pump,...) without documentation about which one was and should used for analysis or their differences.

Basal

Ensure all basal changes are reported and that these reflect
temporary basal rates
pump suspends
closed loop modes
During Closed Loop Mode, are basal events still reported
- We've seen cases where the pump reported zeros while others kept reporting the profile
If information about temporary basal rate changes is providede, this should be additional. The rate should be correct and trusted without having to be modified.
Duplciates: Ensure there are none because these are extremely difficult to investigate (see Duplicates)
Basal Gaps should be marked
- We've seen data gaps result in large basal durations which seems to be caused by retrospective processing (sometimes >250 days (Replace BG)). This makes it difficult to determine if the basal rate is active for more than a day, or if data was missing.
- Gaps in basal rate should be marked: Otherwise we don't know if the data is missing or no basal happened (see Data Gaps)

Example: In Flair, we had to manually add 0 events at suspends, and adopt the rate for temporary basal rates. Further, during closed loop modes basal rates were still reported and had to be shut off.

Boluses

Extended Boluses

We've seen all sorts of things: Extended boluses reported upon completion or start, normal and extended part reported as two rows or in separate column, sometimes the total bolus didn't match the normal and extended part, sometimes different labels for extended parts (see data consistency), etc.

Therefore, make sure to:

Document if timestamp refers to delivery start or flair_suspends_with_cgm
Make sure that the duration reflects early stopped boluses
Provide the delivery duration (in DCLP5 this was missing)

Data Gaps

Often unclear if data gaps are due to missing uploads, disconnected devices, other technical issues. Then it is unclear if the pump was still running and for example if we should trust the last basal event.This is especially difficult if we see gaps where some data types exist (e.g. CGM) while others are missing (basal rates).

A watchdog signal could help.
Documentation of the minimum reported frequency of pump events (does a pump report at least one basal rate per day?)

Data Outside Study Period

Excessive or missing data often appears before or after the defined study period
Should be removed or correctly labeled
A simple table could help to better interpret, trim and select the data:
Study period start and end
Device / intervention
device active/inactive

Validation measures (TDD)

Providing daily patient TDD split by basal and bolus is very helpful
In Flair, this allowed us spot wrong assumptions by comparing TDDs of our extracted data with the true TDDs
Make sure there is only one value per patient and day (we've seen it all)
Document how Basal TDD was calculated