Concept-Set Distribution: Single Site, Anomaly Detection, Longitudinal Analysis
Created
Last Modified
Tags
Publisher
Data Requirements
Abstract
This check is intended to aid a user in understanding the distribution of concepts that form or represent a particular variable in a dataset. This check demonstrates how concept sets drive the prevalence or clinical composition of variables in a study. This check is designed for the identification of anomalous data within a single site’s data over time. Use of this check will inform a researcher which time periods a given code was used abnormaly frequently or infrequently in the dataset.
How to Access This Check
- You may access the module’s R package in GitHub.
Or, run in R
install_github('ssdqa/https://github.com/ssdqa/conceptsetdistribution')
- Using the provided vignettes on GitHub or help in R, follow parameter input instructions for “Single Site”, “Anomaly Detection”, “Longitudinal Analysis” requirements.
Check Output
Visualization Output
This check output varies based on the time increment input by the user. For yearly time increments, a control chart highlights anomalies in the proportion of patients per concept_id
for the provided variable over time. A P Prime chart is used to account for the high sample size, meaning that the standard deviation has been multiplied by a numerical constant. Blue dots along the line indicate non-anomalous values, while orange dots are anomalies. The chart is accompanied by a concept reference table which provides the total count of the concept in question. When using smaller time increments, such as months or weeks, seasonality can make it difficult to detect true anomalies in a time series. This output computes anomalies while ignoring seasonality and outputs 2 graphs: a time series line graph with anomalies highlighted with a red dot, and a four-faceted time series line graph demonstrating the anomaly decomposition to make clarify how the anomalies were identified.
Raw Output
The raw data output of this check produces eight columns of data for annual increments if analysis:
Column | Data Type | Definition |
---|---|---|
site |
character | the name of the site being targeted OR “combined” if multiple sites were provided |
time_start |
date | the start of the time period being examined |
time_increment |
character | the length of each time period |
variable |
character | the user-defined variable grouping assigned to the code |
ct_denom |
numeric | the number of rows in the domain table associated with the variable |
concept_id / concept_code |
numeric / character | the code of interest; for OMOP CDMs this will be concept_id / for PCORnet CDMS this will be concept_code |
ct_concept |
numeric | the number of occurrences of the code |
prop_concept |
numeric | the proportion of variable rows with the code of interest (ct_concept / ct_denom) |
For analyses in monthly increments or less, the raw output produces eleven columns: | ||
Column | Data Type | Definition |
----------------- | --------- | ---------------------------------------------------------------------------------------------------------------- |
observed |
numeric | the original proportion of the concept |
season |
numeric | the seasonal component of the time series |
trend |
numeric | the trend component of the time series |
remainder |
numeric | the residual component after “season” and “trend” are removed from “observed” - target of anomaly detection |
seasadj |
numeric | the adjusted seasonal component |
anomaly |
character | a flag to indicate whether the proportion is an anomaly |
anomaly_direction |
numeric | the direction of the anomaly (upper or lower) |
anomaly_score |
numeric | the distance between the anomaly and the centerline |
recomposed_l1 |
numeric | the lower level bound of the processed time series used to identify lower outliers |
recomposed_l2 |
numeric | the upper level bound of the processed time series used to identify upper outliers |
observed_clean |
numeric | the original proportion after the season and trend components have been removed and anomalies have been detected |