Expected Variables Present: Single Site, Anomaly Detection, Longitudinal Analysis
Created
Last Modified
Domain
Category
Parameters
Publisher
Abstract
This check provides raw data and visualizations to aid a user in evaluating whether expected concepts are present in a dataset of interest. It summarizes the proportion of patients with co-occurring variables. This check promotes the identification of anomalous data for a single site data across time (years).
Data Requirements
Probe
Clinical Assessment
Access Package
# install.packages("devtools")
devtools::install_github('ssdqa/https://github.com/ssdqa/conceptsetdistribution')Visualization Output
This check output varies based on the time increment input by the user. For yearly time increments, the check outputs a control chart displaying the number of pair mappings across time. The user is limited to one concept_id or CDM code per graph A tooltip provides each point’s exact coordinates upon hover. Anomalous visits are distiguished by an orange point while non-anomalous visits are blue points. For smaller time increments (by month or smaller) the check outputs two graphs to visualize anomalies while ignoring seasonality. The first is a time series line graph with anomalies indicated by red dots. The second graph is a four-facet time series line graph that demonstrates the decomposition of the anomalies to clarify how eash anomaly was identified. For each output, a tooltip provides each point’s exact coordinates upon hover. Both graphs represent data for one user-specified specialty at a time.
Raw Output
The raw data output of this check produces ten columns of data for analysis over annual time intervals:
| Column | Data Type | Definition |
|---|---|---|
site |
character | the name of the site being targeted OR “combined” if multiple sites were provided |
time_start |
date | the start of the time period being examined |
time_increment |
character | the length of each time period |
total_pt_ct |
numeric | the total number of patients from the cohort in the domain table |
total_row_ct |
numeric | the total number of rows associated with patients from the cohort in the domain table |
variable_pt_ct |
numeric | the number of patients with evidence of the variable |
variable_row_ct |
numeric | the number of rows with evidence of the variable |
prop_pt_variable |
numeric | the proportion of patients with evidence of the variable |
prop_row_variable |
numeric | the proportion of rows with evidence of the variable |
variable |
character | the name of the variable |
The raw data output of this check produces eleven columns of data for analysis in monthly or weekly time intervals:
| Column | Data Type | Definition |
|---|---|---|
observed |
numeric | the original proportion of patients/rows |
season |
numeric | the seasonal component of the time series |
trend |
numeric | the trend component of the time series |
remainder |
numeric | the residual component after “season” and “trend” are removed from “observed” - target of anomaly detection |
seasadj |
numeric | the adjusted seasonal component |
anomaly |
character | a flag to indicate whether the proportion is an anomaly |
anomaly_direction |
numeric | the direction of the anomaly (upper or lower) |
anomaly_score |
numeric | the distance between the anomaly and the centerline |
recomposed_l1 |
numeric | the lower level bound of the processed time series used to identify lower outliers |
recomposed_l2 |
numeric | the upper level bound of the processed time series used to identify upper outliers |
observed_clean |
numeric | the original proportion after the season and trend components have been removed and anomalies have been detected |

