Concept-Set Distribution: Single Site, Anomaly Detection, Longitudinal Analysis

Last Modified

Click on the thumbnail above to preview images.

Tags














Publisher

PEDSnet

Abstract

This check is intended to aid a user in understanding the distribution of concepts that form or represent a particular variable in a dataset. This check demonstrates how concept sets drive the prevalence or clinical composition of variables in a study. This check is designed for the identification of anomalous data within a single site’s data over time. Use of this check will inform a researcher which time periods a given code was used abnormaly frequently or infrequently in the dataset.


How to Access This Check

  1. You may access the module’s R package in GitHub.
    Or, run in R
install_github('ssdqa/https://github.com/ssdqa/conceptsetdistribution')
  1. Using the provided vignettes on GitHub or help in R, follow parameter input instructions for “Single Site”, “Anomaly Detection”, “Longitudinal Analysis” requirements.

Check Output

Visualization Output

This check output varies based on the time increment input by the user. For yearly time increments, a control chart highlights anomalies in the proportion of patients per concept_id for the provided variable over time. A P Prime chart is used to account for the high sample size, meaning that the standard deviation has been multiplied by a numerical constant. Blue dots along the line indicate non-anomalous values, while orange dots are anomalies. The chart is accompanied by a concept reference table which provides the total count of the concept in question. When using smaller time increments, such as months or weeks, seasonality can make it difficult to detect true anomalies in a time series. This output computes anomalies while ignoring seasonality and outputs 2 graphs: a time series line graph with anomalies highlighted with a red dot, and a four-faceted time series line graph demonstrating the anomaly decomposition to make clarify how the anomalies were identified.

Raw Output

The raw data output of this check produces eight columns of data for annual increments if analysis:

Column Data Type Definition
site character the name of the site being targeted OR “combined” if multiple sites were provided
time_start date the start of the time period being examined
time_increment character the length of each time period
variable character the user-defined variable grouping assigned to the code
ct_denom numeric the number of rows in the domain table associated with the variable
concept_id / concept_code numeric / character the code of interest; for OMOP CDMs this will be concept_id / for PCORnet CDMS this will be concept_code
ct_concept numeric the number of occurrences of the code
prop_concept numeric the proportion of variable rows with the code of interest (ct_concept / ct_denom)

For analyses in monthly increments or less, the raw output produces eleven columns:
Column Data Type Definition
----------------- --------- ----------------------------------------------------------------------------------------------------------------
observed numeric the original proportion of the concept
season numeric the seasonal component of the time series
trend numeric the trend component of the time series
remainder numeric the residual component after “season” and “trend” are removed from “observed” - target of anomaly detection
seasadj numeric the adjusted seasonal component
anomaly character a flag to indicate whether the proportion is an anomaly
anomaly_direction numeric the direction of the anomaly (upper or lower)
anomaly_score numeric the distance between the anomaly and the centerline
recomposed_l1 numeric the lower level bound of the processed time series used to identify lower outliers
recomposed_l2 numeric the upper level bound of the processed time series used to identify upper outliers
observed_clean numeric the original proportion after the season and trend components have been removed and anomalies have been detected

Funder(s)

This research was made possible through the generous support of the Patient-Centered Outcomes Research Institute .

Provenance

Description

Clinical Subjects Headings

Related Data Quality Result

Related Person

relationships.isdDQCheckOf

Related Publications

Creative Commons license

Except where otherwised noted, this item's license is described as a CC-BY Attribution 4.0 License.