Duplicate Records: Single Site, Anomaly Detection, Cross-Sectional Analysis


Created

Last Modified

Click on the thumbnail above to preview images.

Domain

Category

Parameters

Publisher

PEDSnet

Abstract

This check provides raw data and visualizations to aid a user in evaluating whether duplicate records are present in a dataset of interest. It summarizes the proportion of duplicate rows & patients with duplicate rows, as well as the median number of duplicate rows per patient.

Probe

Clinical Assessment

Access Package

# install.packages("devtools") devtools::install_github('ssdqa/duplicaterecords')

Visualization Output

This check outputs a bar graph displaying either the proportion or number of patients with a number of duplicate rows that fall a user-selected number of standard deviations away from the mean.

Raw Output

This check produces a raw data output containing 14 columns:

Column Data Type Definition
site character the name of the site being targeted
duplicate_definition character an alias to describe the definition of duplication being investigated
duplicate_columns character the name(s) of the column(s) included or excluded to define duplication
n_w_fact numeric the total number of patients with evidence of duplication
sd_fact numeric the standard deviation of the number of duplicate values per patient, only including patients who have evidence of duplication
mean_fact numeric the mean of the number of duplicate values per patient, only including patients who have evidence of duplication
outlier_fact numeric the number of patients, only including patients who have evidence of duplication, who fall a user-selected number of standard deviations away from the mean
prop_outlier_fact numeric the proportion of patients who fall a user-selected number of standard deviations away from the mean out of patients who have evidence of duplication
n_tot numeric the total number of patients
sd_tot numeric the standard deviation of the number of duplicate rows per patient for all patients
mean_tot numeric the mean of the number of duplicate rows per patient for all patients
outlier_tot numeric the number of patients, out of all patients, who fall a user-selected number of standard deviations away from the mean
prop_outlier_tot numeric the proportion of patients who fall a user-selected number of standard deviations away from the mean out of all patients
output_function character a string indicating the type of visualization that should be generated by dr_output

Funder(s)

This research was made possible through the generous support of Patient-Centered Outcomes Research Institute. The statements presented in this work are solely the responsibility of the author(s) and do not necessarily represent the views of PCORI, its Board of Governors, or its Methodology Committee.

Provenance

Description

Clinical Subjects Headings

Related Data Quality Result

Related Person

Related Code

Study-Specific Quality, Utility, and Breadth Assessment
Created:2025-11Affiliation:PEDSnet Data Coordinating Center
This suite of R packages allows one to investigate multiple facets of data quality and customize analyses based on your study-specific needs. Each module allows up to 8 different analyses in either the OMOP or PCORnet CDM, all aimed at taking a different view of the data while still addressing the same data quality probe.

##### [View pkgdown summary here.](https://ssdqa.github.io/squba/)

Related Data Quality Check

Related Publications

Creative Commons license

Except where otherwised noted, this item's license is described as a CC-BY Attribution 4.0 License.

Cite this Data Quality Check

Wieand, K. & Razzaghi, H. (2026, March). Duplicate Records: Single Site, Anomaly Detection, Cross-Sectional Analysis. [D Q Check]. PEDSpace Knowledge Bank. https://doi.org/10.24373/pdsp-709