Applies garbage data using parameters from variables.csv extension columns. Garbage data is specified at the variable level.
Arguments
- values
Vector. Generated values (already has valid + missing).
- var_row
Data frame. Single row from variables.csv (contains garbage data parameters).
- variable_type
Character. "categorical", "continuous", "integer", "date".
- missing_codes
Numeric vector. Optional. Missing codes extracted from metadata (rows where recEnd contains "NA::"). Used to exclude missing codes from valid range calculations. If NULL, uses hardcoded fallback for backward compatibility.
- seed
Integer. Optional random seed.
Details
Garbage data fields (in variables.csv):
Garbage parameters:
garbage_low_prop: Proportion for low garbage values (0-1)
garbage_low_range: Range for low values (interval notation "min,max")
garbage_high_prop: Proportion for high garbage values (0-1)
garbage_high_range: Range for high values (interval notation "min,max")
Application order:
Apply garbage_low (if specified)
Apply garbage_high (if specified)
No overlap - indices removed after each application
Config-driven generation: If var_row is NULL or missing garbage fields, no garbage data applied.
Examples
if (FALSE) { # \dontrun{
var_row <- data.frame(
variable = "BMI",
garbage_low_prop = 0.02,
garbage_low_range = "[-10,0]",
garbage_high_prop = 0.01,
garbage_high_range = "[60,150]"
)
values <- c(23.5, 45.2, 30.1, 18.9, 25.6)
result <- apply_garbage(values, var_row, "continuous", seed = 123)
} # }