Expected Variables Present: Single Site, Anomaly Detection, Longitudinal Analysis


Created

Last Modified

Click on the thumbnail above to preview images.

Domain

Category

Parameters

Publisher

PEDSnet

Abstract

This check provides raw data and visualizations to aid a user in evaluating whether expected concepts are present in a dataset of interest. It summarizes the proportion of patients with co-occurring variables. This check promotes the identification of anomalous data for a single site data across time (years).

Probe

Clinical Assessment

Access Package

# install.packages("devtools") devtools::install_github('ssdqa/https://github.com/ssdqa/conceptsetdistribution')

Visualization Output

This check output varies based on the time increment input by the user. For yearly time increments, the check outputs a control chart displaying the number of pair mappings across time. The user is limited to one concept_id or CDM code per graph A tooltip provides each point’s exact coordinates upon hover. Anomalous visits are distiguished by an orange point while non-anomalous visits are blue points. For smaller time increments (by month or smaller) the check outputs two graphs to visualize anomalies while ignoring seasonality. The first is a time series line graph with anomalies indicated by red dots. The second graph is a four-facet time series line graph that demonstrates the decomposition of the anomalies to clarify how eash anomaly was identified. For each output, a tooltip provides each point’s exact coordinates upon hover. Both graphs represent data for one user-specified specialty at a time.


Raw Output

The raw data output of this check produces ten columns of data for analysis over annual time intervals:

Column Data Type Definition
site character the name of the site being targeted OR “combined” if multiple sites were provided
time_start date the start of the time period being examined
time_increment character the length of each time period
total_pt_ct numeric the total number of patients from the cohort in the domain table
total_row_ct numeric the total number of rows associated with patients from the cohort in the domain table
variable_pt_ct numeric the number of patients with evidence of the variable
variable_row_ct numeric the number of rows with evidence of the variable
prop_pt_variable numeric the proportion of patients with evidence of the variable
prop_row_variable numeric the proportion of rows with evidence of the variable
variable character the name of the variable

The raw data output of this check produces eleven columns of data for analysis in monthly or weekly time intervals:

Column Data Type Definition
observed numeric the original proportion of patients/rows
season numeric the seasonal component of the time series
trend numeric the trend component of the time series
remainder numeric the residual component after “season” and “trend” are removed from “observed” - target of anomaly detection
seasadj numeric the adjusted seasonal component
anomaly character a flag to indicate whether the proportion is an anomaly
anomaly_direction numeric the direction of the anomaly (upper or lower)
anomaly_score numeric the distance between the anomaly and the centerline
recomposed_l1 numeric the lower level bound of the processed time series used to identify lower outliers
recomposed_l2 numeric the upper level bound of the processed time series used to identify upper outliers
observed_clean numeric the original proportion after the season and trend components have been removed and anomalies have been detected

Funder(s)

This research was made possible through the generous support of Patient-Centered Outcomes Research Institute. The statements presented in this work are solely the responsibility of the author(s) and do not necessarily represent the views of PCORI, its Board of Governors, or its Methodology Committee.

Provenance

Description

Clinical Subjects Headings

Related Data Quality Result

Expected Variables Present Study Results III: PAQS Query 3
Created:2025-05-30Affiliation:PEDSnet Data Coordinating Center
The results of an Expected Variables Present check using the Single Site, Anomaly Detection, Longitudinal parameters. This check evaluates the annual distributions of key variables related to diabetes: stroke, second-line antidiabetics, ketoacidosis, an Hba1c > 8%, elevated blood pressure, and CKD.

Related Person

Related Code

Study-Specific Quality, Utility, and Breadth Assessment
Created:2025-11Affiliation:PEDSnet Data Coordinating Center
This suite of R packages allows one to investigate multiple facets of data quality and customize analyses based on your study-specific needs. Each module allows up to 8 different analyses in either the OMOP or PCORnet CDM, all aimed at taking a different view of the data while still addressing the same data quality probe.

##### [View pkgdown summary here.](https://ssdqa.github.io/squba/)

Related Data Quality Check

Related Publications

Creative Commons license

Except where otherwised noted, this item's license is described as a CC-BY Attribution 4.0 License.

Cite this Data Quality Check

PEDSnet Data Coordinating Center. (2024, June). Expected Variables Present: Single Site, Anomaly Detection, Longitudinal Analysis. [D Q Check]. PEDSpace Knowledge Bank. https://doi.org/10.24373/pdsp-467