Duplicate Records: Multi Site, Anomaly Detection, Cross-Sectional Analysis


dc.contributorPatient-Centered Outcomes Research Institute
dc.contributor.authorWieand, Kaleigh
dc.contributor.authorRazzaghi, Hanieh
dc.contributor.otherPEDSnet Data Coordinating Center
dc.date.accessioned2026-05-18T14:14:38Z
dc.date.created2026-03-27
dc.description.abstractThis check provides raw data and visualizations to aid a user in evaluating whether duplicate records are present in a dataset of interest. It summarizes the proportion of duplicate rows & patients with duplicate rows, as well as the median number of duplicate rows per patient.
dc.identifier.urihttps://hdl.handle.net/20.500.14642/1654
dc.identifier.urihttps://doi.org/10.24373/pdsp-713
dc.publisherPEDSnet
dc.relation.urihttps://github.com/ssdqa/duplicaterecords
dc.rightsa CC-BY Attribution 4.0 License.
dc.rights.urihttp://creativecommons.org/licenses/by/4.0
dc.subjectEvent-Level Analysis
dc.subjectMulti-Site Analysis
dc.subjectData Anomaly Method
dc.subjectCross-Sectional Analysis
dc.titleDuplicate Records: Multi Site, Anomaly Detection, Cross-Sectional Analysis
dspace.entity.typeDQCheck
local.code.package# install.packages("devtools") devtools::install_github('ssdqa/duplicaterecords')
local.description.rawThis check produces a raw data output containing 14 columns: <br> |Column |Data Type|Definition | |----------------|---------|--------------------------------------------------------------------------------------------| |`site` |character|the name of the site being targeted | |`duplicate_definition` |character|an alias to describe the definition of duplication being investigated | |`duplicate_columns` |character|the name(s) of the column(s) included or excluded to define duplication| |`total_rows` |numeric |the total number of rows in the domain | |`total_pt` |numeric |the total number of patients in the domain | |`duplicate_rows` |numeric |the number of duplicate rows | |`duplicate_pt` |numeric |the number of patients with at least one duplicate row | |`duplicate_row_prop`|numeric|the proportion of duplicate rows| |`duplicate_pt_prop`|numeric|the proportion of patients with at least one duplicate row| |`median_all_with0s`|numeric|the median number of duplicate rows per patient, for all patients, across all sites| |`median_all_without0s`|numeric|the median number of duplicate rows per patient, for only patients with evidence of duplication, across all sites| |`median_site_with0s`|numeric|the median number of duplicate rows per patient, for all patients, across a specific site| |`median_site_without0s`|numeric|the median number of duplicate rows per patient, for only patients with evidence of duplication, for a specific site| |`mean_val` | numeric | the mean proportion of patients or rows (based on user selection) for each group across sites | |`median_val` | numeric | the median proportion of patients or rows (based on user selection) for each group across sites | |`sd_val` | numeric | the standard deviation of the proportion of patients or rows (based on user selection) for each group across sites | |`mad_val` | numeric | the median absolute deviation of the proportion of patients or rows (based on user selection) for each group across sites | |`cov_val` | numeric | the coefficient of variance of the proportion of patients or rows (based on user selection) for each group across sites | |`max_val` | numeric | the maximum proportion of patients or rows (based on user selection) for each group across sites | |`min_val` | numeric | the minimum prorportion of patients or rows (based on user selection) for each group across sites | |`range_val` | numeric | the range of the proportion of patients or rows (based on user selection) for each group across sites | |`total_ct` | numeric | the total number of group members | |`analysis_eligible` | character | a string indicating whether the group is eligible for anomaly detection analysis | |`lower_tail` | numeric | the lower bound used to identify low anomalies | |`upper_tail` | numeric | the upper bound used to identify high anomalies | |`anomaly_yn` | character | a string indicating whether the value is anomalous or not | |`output_function`|character|a string indicating the type of visualization that should be generated by dr_output| {.dqcheck-table}
local.description.vizThis check outputs a dot plot representing anomalous proportions of patients or rows with duplicate values for a given duplicate definition per site. This graph summarizes the mean value for the duplicate definition by the dot size, the proportion of duplicate values by the dot color, and whether that definition is anomalous by replacing the dot with a star. A tooltip provides metadata for the definition and the site and precise values for proportion, mean proportion, median proportion, standard deviation and MAD upon hover.
local.dqcheck.categoryConformance
local.dqcheck.clinicalprobeClinical Data Distributions
local.dqcheck.clinicalprobeExpected Clinical Event Representation
local.dqcheck.probeMissing Expected Data
local.dqcheck.probeAnomalous Values from Internal Distributions
local.dqcheck.requirementcohort
local.dqcheck.requirementdr_input_file
local.dqcheck.requirementomop_or_pcornet
local.dqcheck.requirementsingle_or_multi_site
local.dqcheck.requirementanomaly_or_exploratory
local.dqcheck.requirementtime
local.dqcheck.requirementpatient_level_tbl
local.dqcheck.requirementoutput_level
local.dqcheck.vizDot and Star Plot
relation.isCodeOfDQCheck929c8dfc-2c8b-4e62-8e1d-0fa06c542832
relation.isCodeOfDQCheck.latestForDiscovery929c8dfc-2c8b-4e62-8e1d-0fa06c542832

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
dr_ms_anom_cs.png
Size:
87.66 KB
Format:
Portable Network Graphics