Working with CGM Data: Part 3 - Cleaning and Processing New Data with Python and SQLite

Details: By J. Smith; November 24,2024; Hits: 286

# Import packages

# For data manipulation
import numpy as np
import pandas as pd

# For working with datetime objects
from datetime import datetime

# For working with SQLite databases
import sqlite3

Abstract

Establishing a reliable and efficient process for managing continuous glucose monitor (CGM) data ensures the dataset remains accurate, consistent, and manageable. Using Python and SQLite, new data is cleaned, validated , and added to the database, and prepped for visualizations in Tableau.

Key Points

Data Storage: Only the most recent 90 days of CGM data are stored in the SQLite database, optimizing storage and focusing on relevant data.
Data Validation: Duplicate entries are removed, missing dates are identified, and the dataset remains complete.
Efficient Integration: New data is appended to the existing database without overwriting or redundancy.
Prepared for Visualization: Cleaned and validated data is ready for use in visualization tools like Tableau.

Introduction

Cleaning and organizing continuous glucose monitor (CGM) data is essential for accurate analysis and meaningful visualizations. Building on the SQLite database created earlier, this part emphasizes how to maintain clean datasets for ongoing analysis and visualization.

Why Use SQLite?

Storing cleaned and formatted CGM data in an SQLite database simplifies critical operations such as:

Reviewing existing records.
Adding new, validated data.
Checking for missing dates.
Deleting duplicate entries.

New data is appended to the existing database instead of replacing the table, ensuring no loss of information.

# Add the new data to the existing CLARITY_DATA table
df0.to_sql('CLARITY_DATA', con=sqlite3.connect('ClarityHistory.db'), if_exists='append', index=False)

Storage Strategy

The initial plan to store all data in the SQLite database has evolved to a more efficient approach:

Focus on Recent Data: Only the most recent 90 days of data are retained, as visualizations focus on this timeframe.
Efficient Management: As new data is added, older entries are removed from the database.
Minimized Local Storage: Older data remains on the source server, enabling access without increasing local storage demands.

The following SQL query implements the deletion of rows older than 90 days:

%%sql
-- Delete rows in CLARITY_DATA that are older than 90 days
DELETE FROM CLARITY_DATA
WHERE Date < (SELECT date(MAX(Date), '-89 day') FROM CLARITY_DATA);

Key Python Insights

Analyzing the CGM dataset, validating the data, and summarizing it is very efficient with Python and pandas. Here are key operations and their outputs.

Counting Days in the Dataset: Identifying the number of unique days is as simple as calling nunique() on the Date column.
```
# Count the number of unique days in the dataset
unique_days1 = df1['Date'].nunique()
```
Finding Days with At Least 144 Readings: Identifying the number of days with at least 144 readings (approximately half a day's worth of CGM data) is also simple.
```
# Count the number of unique days in the dataset that have at least 144 readings
days_with_144_readings = df1['Date'].value_counts().ge(144).sum()
```
- .value_counts() counts occurrences of each unique date.
- .ge(144) identifies days with 144 or more readings.
- .sum() tallies these days.

Displaying the results

# Display the results
print(f"Days in Dataset: {unique_days1}")
print(f"Days With at Least 144 Readings: {days_with_144_readings}")

Sample Output

Days in Dataset: 90
Days With at Least 144 Readings: 88

Why Check for Days with At Least 144 Readings?

This step is a data validation measure akin to identifying outliers in a dataset. In continuous glucose monitor (CGM) data from a Dexcom device, a full day typically consists of 288 readings (one every 5 minutes). A day with significantly fewer readings may indicate sensor, software, or other issues that should be noted without excluding the data.

Finding Days with Fewer Than 144 Readings

# Get a count of readings per day
readings_per_day = df1['Date'].value_counts()

# Identify the days with fewer than 144 readings
days_below_144_readings = readings_per_day[readings_per_day < 144]

# Sort the days by date in ascending order
days_below_144_readings = days_below_144_readings.sort_index()

# Format the output
num_days_below_144 = len(days_below_144_readings)
formatted_output = days_below_144_readings.reset_index()

# Display the results
print(f"Days with Fewer Than 144 Readings: {num_days_below_144}")
for date, count in formatted_output.values:
    print(f"{date}: {count}")

Sample Output

Days with Fewer Than 144 Readings: 2
2024-10-14: 57
2024-10-15: 100

Database Management

The database operations utilize SQLite's power to manage data efficiently:

Appending new data.
Identifying duplicate rows.
Deleting duplicate rows.
Checking for missing dates.

Summary

Establishing a reliable and efficient process for cleaning, validating, and integrating new data into an SQLite database ensures the dataset remains manageable and meets analytical needs, providing a solid foundation for accurate analysis and meaningful insights.

Update: Streamlining File Handling in the Python Program

I've made an update to the Python program to eliminate two manual steps: renaming the downloaded file to remove date information and copying it to the program’s working directory. The updated code now searches for the latest Clarity export file in the downloads folder and automatically selects it. If no matching file is found, the program stops instead of proceeding and generating an error.

To enable this functionality, I've added glob, os, and sys to the import section. Here’s how the update works:

# Import packages

# For data manipulation
import numpy as np
import pandas as pd

# For working with datetime objects
from datetime import datetime

# For working with SQLite databases
import sqlite3

# For working with files
import glob
import os

# For working with system commands
import sys

# Find the latest Clarity data file

# Define the directory and filename pattern
directory = r"C:\Users\clock\Downloads"
pattern = "Clarity_Export_Smith_Jeffrey_*.csv"

# Construct the full search pattern
search_pattern = os.path.join(directory, pattern)

# Find all matching files
matching_files = glob.glob(search_pattern)

# Ensure there is at least one match
if not matching_files:
    raise SystemExit("No matching files found. Exiting program.")

# Sort files by modification time to get the most recent one
latest_file = max(matching_files, key=os.path.getmtime)

# Extract just the file name
file_name = os.path.basename(latest_file)

print(f"Found file: {file_name}")

Python Code at GitHub

Continue to Working with CGM Data: Part 4 – Visualizing The Data in Tableau

Frequently Asked Questions

What happens if my CGM dataset has missing timestamps or overlapping entries?
Missing data is an inherent aspect of working with CGM datasets. Gaps may occur during firmware updates or when transitioning to a new CGM sensor after the 15-day usage limit. These gaps are normal and should be displayed as-is, as interpolation risks misrepresenting the data and introducing inaccuracies. Tools like Dexcom Clarity highlight these gaps, while Tableau connects points across missing data with a straight line. Additionally, in regions that observe seasonal time changes, such as the U.S., a one-hour gap will occur during the switch from standard time to daylight savings time. These gaps should be documented to ensure accurate interpretation of the dataset.

Overlapping data occurs during the transition from daylight savings time to standard time in regions like the U.S. During this change, the 1:00 AM hour repeats, resulting in duplicate timestamps. Dexcom Clarity visualizes this by displaying two lines for the overlapping hour. Tableau, in contrast, requires all data points to be aggregated (e.g., using sum, average, minimum, or maximum). For instance, setting Tableau to average duplicate readings ensures a single value is displayed for each timestamp. When working with overlapping data, it is recommended to leave it in the dataset, as it reflects actual conditions during the seasonal time change.

Why did you choose SQLite over other database systems for this project?
SQLite is lightweight, portable, and easy to integrate with Python. It requires no additional setup, making it ideal for small to medium datasets, such as CGM data, that grow incrementally over time.

What is the maximum dataset size SQLite can handle efficiently?
SQLite performs well with datasets up to several gigabytes. For datasets exceeding this size, consider moving to a database system such as PostgreSQL or MySQL.

What file formats does your cleaning process support?
The process is designed for CSV files but can be adapted for other formats like Excel or JSON by modifying the data import function (e.g., using pd.read_excel() or pd.read_json()).

How can I adapt these cleaning steps for CGM datasets from different manufacturers?
Understand the data format provided by the manufacturer (e.g., Abbott, Dexcom). Look for fields analogous to timestamps, glucose readings, and device-generated metadata. Adjust column mappings and validation checks to match.

Working with CGM Data: Part 3 - Cleaning and Processing New Data with Python and SQLite

Articles

News

Portfolio

Contact Me

Search