Duplicate Records: Single Site, Anomaly Detection, Cross-Sectional Analysis
Created
Last Modified
Click on the thumbnail above to preview images.
Files
1 Download↓
Domain
Category
Parameters
Publisher
PEDSnet
Abstract
This check provides raw data and visualizations to aid a user in evaluating whether duplicate records are present in a dataset of interest. It summarizes the proportion of duplicate rows & patients with duplicate rows, as well as the median number of duplicate rows per patient.
Data Requirements
Probe
Clinical Assessment
Access Package
# install.packages("devtools")
devtools::install_github('ssdqa/duplicaterecords')Visualization Output
This check outputs a bar graph displaying either the proportion or number of patients with a number of duplicate rows that fall a user-selected number of standard deviations away from the mean.
Raw Output
This check produces a raw data output containing 14 columns:
| Column | Data Type | Definition |
|---|---|---|
site |
character | the name of the site being targeted |
duplicate_definition |
character | an alias to describe the definition of duplication being investigated |
duplicate_columns |
character | the name(s) of the column(s) included or excluded to define duplication |
n_w_fact |
numeric | the total number of patients with evidence of duplication |
sd_fact |
numeric | the standard deviation of the number of duplicate values per patient, only including patients who have evidence of duplication |
mean_fact |
numeric | the mean of the number of duplicate values per patient, only including patients who have evidence of duplication |
outlier_fact |
numeric | the number of patients, only including patients who have evidence of duplication, who fall a user-selected number of standard deviations away from the mean |
prop_outlier_fact |
numeric | the proportion of patients who fall a user-selected number of standard deviations away from the mean out of patients who have evidence of duplication |
n_tot |
numeric | the total number of patients |
sd_tot |
numeric | the standard deviation of the number of duplicate rows per patient for all patients |
mean_tot |
numeric | the mean of the number of duplicate rows per patient for all patients |
outlier_tot |
numeric | the number of patients, out of all patients, who fall a user-selected number of standard deviations away from the mean |
prop_outlier_tot |
numeric | the proportion of patients who fall a user-selected number of standard deviations away from the mean out of all patients |
output_function |
character | a string indicating the type of visualization that should be generated by dr_output |
Affiliation(s)
Funder(s)
This research was made possible through the generous support of Patient-Centered Outcomes Research Institute. The statements presented in this work are solely the responsibility of the author(s) and do not necessarily represent the views of PCORI, its Board of Governors, or its Methodology Committee.
Provenance
Description
Development Code
Clinical Subjects Headings
Related Publications
Creative Commons license
Except where otherwised noted, this item's license is described as a CC-BY Attribution 4.0 License.
Cite this Data Quality Check
Wieand, K. & Razzaghi, H. (2026, March). Duplicate Records: Single Site, Anomaly Detection, Cross-Sectional Analysis. [D Q Check]. PEDSpace Knowledge Bank. https://doi.org/10.24373/pdsp-709

