postprocess_mock_data() applies v0.4 mock_spec missing-code and
garbage-value rules to an already generated baseline data frame. It records a
mockdata_diagnostics attribute so downstream checks can distinguish values
assigned by post-processing from values that were drawn naturally by the
baseline generator.
Details
Missing-code diagnostics separate values that were naturally drawn as a
declared missing code (preexisting_missing_code_indices) from values that
were assigned by post-processing (assigned_missing_indices). Garbage rules
are applied only to rows that are not missing-code diagnostics, preserving the
audit trail for collision cases such as a valid category code that is also a
declared missing code.
Garbage rules are applied in canonical order: low, then high, then any
other named rules in caller order. Each garbage rule is a named list with a
proportion field and a range field using MockData range notation, for
example list(high = list(proportion = 0.05, range = "[150, 200]")).
Diagnostics are stored as a data-frame attribute. Base R subsetting and some downstream tools may drop attributes, so preserve the original post-processed object when diagnostics are part of the audit trail.
See also
generate_mock_data_native(), generate_mock_data_simstudy(),
mock_spec()
Other mock generation APIs:
create_mock_data(),
generate_mock_data_native(),
generate_mock_data_simstudy()
Examples
spec <- mock_categorical(
"smoking",
levels = c("never", "former", "current"),
proportions = c(0.5, 0.3, 0.2),
rtype = "character",
missing_codes = "9",
missing_proportions = 0.05
)
baseline <- generate_mock_data_native(spec, n = 20, seed = 1)
result <- postprocess_mock_data(baseline, spec, seed = 2)
attr(result, "mockdata_diagnostics")$variables$smoking
#> $n
#> [1] 20
#>
#> $preexisting_missing_code_indices
#> integer(0)
#>
#> $assigned_missing_indices
#> [1] 15
#>
#> $assigned_missing_codes
#> [1] "9"
#>
#> $assigned_garbage_indices
#> named list()
#>
#> $assigned_garbage_values
#> named list()
#>