Concept-Set Distribution: Single Site, Anomaly Detection, Longitudinal Analysis

PEDSnet

Data Quality Check

Concept-Set Distribution: Single Site, Anomaly Detection, Longitudinal Analysis

Created

2024-06-05

Click on the thumbnail above to preview images.

Files

Publisher

PEDSnet

Data Requirements

cohort , domain_tbl , concept_set , omop_or_pcornet , multi_or_single_site , anomaly_or_exploratory , num_concept_combined , num_concept_1 , num_concept_2 , p_value , age_groups , time , time_span , time_period

Abstract

This check is intended to aid a user in understanding the distribution of concepts that form or represent a particular variable in a dataset. This check demonstrates how concept sets drive the prevalence or clinical composition of variables in a study. This check is designed for the identification of anomalous data within a single site’s data over time. Use of this check will inform a researcher which time periods a given code was used abnormaly frequently or infrequently in the dataset.

How to Access This Check

You may access the module’s R package in GitHub.
Or, run in R

install_github('ssdqa/https://github.com/ssdqa/conceptsetdistribution')

Using the provided vignettes on GitHub or help in R, follow parameter input instructions for “Single Site”, “Anomaly Detection”, “Longitudinal Analysis” requirements.

Check Output

Visualization Output

This check output varies based on the time increment input by the user. For yearly time increments, a control chart highlights anomalies in the proportion of patients per concept_id for the provided variable over time. A P Prime chart is used to account for the high sample size, meaning that the standard deviation has been multiplied by a numerical constant. Blue dots along the line indicate non-anomalous values, while orange dots are anomalies. The chart is accompanied by a concept reference table which provides the total count of the concept in question. When using smaller time increments, such as months or weeks, seasonality can make it difficult to detect true anomalies in a time series. This output computes anomalies while ignoring seasonality and outputs 2 graphs: a time series line graph with anomalies highlighted with a red dot, and a four-faceted time series line graph demonstrating the anomaly decomposition to make clarify how the anomalies were identified.

Raw Output

The raw data output of this check produces eight columns of data for annual increments if analysis:

Column	Data Type	Definition
`site`	character	the name of the site being targeted OR “combined” if multiple sites were provided
`time_start`	date	the start of the time period being examined
`time_increment`	character	the length of each time period
`variable`	character	the user-defined variable grouping assigned to the code
`ct_denom`	numeric	the number of rows in the domain table associated with the variable
`concept_id` / `concept_code`	numeric / character	the code of interest; for OMOP CDMs this will be `concept_id` / for PCORnet CDMS this will be concept_code
`ct_concept`	numeric	the number of occurrences of the code
`prop_concept`	numeric	the proportion of variable rows with the code of interest (ct_concept / ct_denom)

For analyses in monthly increments or less, the raw output produces eleven columns:
Column	Data Type	Definition
-----------------	---------	----------------------------------------------------------------------------------------------------------------
`observed`	numeric	the original proportion of the concept
`season`	numeric	the seasonal component of the time series
`trend`	numeric	the trend component of the time series
`remainder`	numeric	the residual component after “season” and “trend” are removed from “observed” - target of anomaly detection
`seasadj`	numeric	the adjusted seasonal component
`anomaly`	character	a flag to indicate whether the proportion is an anomaly
`anomaly_direction`	numeric	the direction of the anomaly (upper or lower)
`anomaly_score`	numeric	the distance between the anomaly and the centerline
`recomposed_l1`	numeric	the lower level bound of the processed time series used to identify lower outliers
`recomposed_l2`	numeric	the upper level bound of the processed time series used to identify upper outliers
`observed_clean`	numeric	the original proportion after the season and trend components have been removed and anomalies have been detected

Affiliation(s)

Children's Hospital of Philadelphia

Funder(s)

This research was made possible through the generous support of the Patient-Centered Outcomes Research Institute .

Development Code

https://github.com/ssdqa/conceptsetdistribution

Creative Commons license

Except where otherwised noted, this item's license is described as a CC-BY Attribution 4.0 License.

Full item page

Concept-Set Distribution: Single Site, Anomaly Detection, Longitudinal Analysis

Created

Last Modified

Files

Tags

Publisher

Data Requirements

Abstract

How to Access This Check

Check Output

Visualization Output

Raw Output

Affiliation(s)

Funder(s)

Provenance

Description

Development Code

Clinical Subjects Headings

Related Data Quality Result

Related Person

relationships.isdDQCheckOf

Related Publications

Creative Commons license