Skip to contents

Applies garbage data using parameters from variables.csv extension columns. Garbage data is specified at the variable level.

Usage

apply_garbage(
  values,
  var_row,
  variable_type,
  missing_codes = NULL,
  seed = NULL
)

Arguments

values

Vector. Generated values (already has valid + missing).

var_row

Data frame. Single row from variables.csv (contains garbage data parameters).

variable_type

Character. "categorical", "continuous", "integer", "date".

missing_codes

Numeric vector. Optional. Missing codes extracted from metadata (rows where recEnd contains "NA::"). Used to exclude missing codes from valid range calculations. If NULL, uses hardcoded fallback for backward compatibility.

seed

Integer. Optional random seed.

Value

Vector with garbage data applied.

Details

Garbage data fields (in variables.csv):

Garbage parameters:

  • garbage_low_prop: Proportion for low garbage values (0-1)

  • garbage_low_range: Range for low values (interval notation "min,max")

  • garbage_high_prop: Proportion for high garbage values (0-1)

  • garbage_high_range: Range for high values (interval notation "min,max")

Application order:

  1. Apply garbage_low (if specified)

  2. Apply garbage_high (if specified)

  3. No overlap - indices removed after each application

Config-driven generation: If var_row is NULL or missing garbage fields, no garbage data applied.

Examples

if (FALSE) { # \dontrun{
var_row <- data.frame(
  variable = "BMI",
  garbage_low_prop = 0.02,
  garbage_low_range = "[-10,0]",
  garbage_high_prop = 0.01,
  garbage_high_range = "[60,150]"
)
values <- c(23.5, 45.2, 30.1, 18.9, 25.6)
result <- apply_garbage(values, var_row, "continuous", seed = 123)
} # }