Patient Records Consistency: Single Site, Anomaly Detection, Longitudinal Analysis
Created
Last Modified
Tags
Publisher
Data Requirements
Abstract
This check provides analyses to identify anomalous data across time at the level of a single site. The Patient Record Consistency module, part of the larger SSDQA ecosystem, tests the consistency of clinical data representation within a patient’s record. The goal is to ensure that the patient’s information is confirmatory and complete, such that two events that are expected to co-exist do both occur within the same patient (i.e. a leukemia diagnosis and chemotherapy).
How to Access This Check
- You may access the module’s R package in GitHub.
Or, run in R
install_github('ssdqa/patientrecordconsistency')
- Using the provided vignettes on GitHub or help in R, follow parameter input instructions for “Single-Site”, “Anomaly Detection”, “Longitudinal Analysis” requirements.
Check Output
Visualization Output
This check’s visual output depends on the time increment input by the user.
For yearly time increments, this check outputs a control chart that highlights anomalies in the proportion of patients per event category. A P Prime
chart is used to account for the high sample size, which means that the standard deviation is multiplied by a numerical constant. Blue dots along the line indicate non-anomalous values, while orange dots are anomalies.Only one event category should be specified via the event_filter
parameter to be displayed on the graph. Any of the four options seen in the other output may be chosen with a
, b
, both
, or neither
.
For smaller time increments (by month or smaller), seasonality can make it difficult to detect true anomalies in a time series. This output computes anomalies while ignoring seasonality and outputs 2 graphs:
- A time series line graph with anomalies highlighted with a red dot.
- A 4-facet time series line graph that demonstrates the decomposition of the anomalies to make it more clear how the anomalies were identified.
Raw Output
This check produces a raw data output containing 9 columns of data for analyses over annual intervals:
Column | Data Type | Definition |
---|---|---|
site |
character | the name of the site being targeted OR “combined” if multiple sites were provided |
time_start |
date | the start of the time period being examined |
time_increment |
character | the length of each time period |
event_a_name |
character | the name of event A |
event_b_name |
character | the name of event B |
total_pts |
numeric | the total number of eligible patients in the cohort during the time period |
stat_type |
character | string indicating the event combination of interest: A only, B only, both, or neither |
stat_ct |
numeric | the count of patients meeting the criteria for stat_type in the time period of interest |
prop_event |
numeric | the proportion of patients meeting the criteria for stat_type in the time period of interest |
It produces 11 columns of data for analyses over time of monthly or weekly intervals:
Column | Data Type | Definition |
---|---|---|
observed |
numeric | the original proportion of patients |
season |
numeric | the seasonal component of the time series |
trend |
numeric | the trend component of the time series |
remainder |
numeric | the residual component after “season” and “trend” are removed from “observed” - target of anomaly detection |
seasadj |
numeric | the adjusted seasonal component |
anomaly |
character | a flag to indicate whether the proportion is an anomaly |
anomaly_direction |
numeric | the direction of the anomaly (upper or lower) |
anomaly_score |
numeric | the distance between the anomaly and the centerline |
recomposed_l1 |
numeric | the lower level bound of the processed time series used to identify lower outliers |
recomposed_l2 |
numeric | the upper level bound of the processed time series used to identify upper outliers |
observed_clean |
numeric | the original proportion after the season and trend components have been removed and anomalies have been detected |