variables <- data.frame(
variable = c("age", "smoking"),
variableType = c("Continuous", "Categorical"),
rType = c("integer", "character"),
role = c("enabled", "enabled"),
position = c(10, 20),
distribution = c("normal", NA),
mean = c(50, NA),
sd = c(12, NA),
stringsAsFactors = FALSE
)
variable_details <- data.frame(
variable = c("age", "age", "smoking", "smoking", "smoking"),
recStart = c("[18, 85]", "999", "1", "2", "7"),
recEnd = c("copy", "NA::b", "copy", "copy", "NA::b"),
proportion = c(0.95, 0.05, 0.60, 0.35, 0.05),
stringsAsFactors = FALSE
)About this vignette: This how-to is for users moving existing create_mock_data() workflows from v0.3 to v0.4. It focuses on the compatibility wrapper, routing messages, diagnostics, and reproducibility differences.
What stayed the same
The main entry point is still create_mock_data(), and the existing arguments are still available.
mock_data <- create_mock_data(
databaseStart = "study",
variables = variables,
variable_details = variable_details,
n = 100,
seed = 123
)
head(mock_data) age smoking
1 43 1
2 47 2
3 69 1
4 51 1
5 999 1
6 71 7
For supported metadata, v0.4 routes this call through the new mock_spec pipeline.
See which path ran
Use verbose = TRUE when migrating. The message tells you whether the v0.4 path or the legacy path was used.
strict_data <- create_mock_data(
databaseStart = "study",
variables = variables,
variable_details = variable_details,
n = 50,
seed = 456,
verbose = TRUE
)Generating via v0.4 mock_spec pipeline.
The v0.4 path returns a data frame with a diagnostics attribute.
Opt into legacy behavior
Set validate = FALSE when you need the legacy v0.3 dispatch path during migration. This is the explicit compatibility opt-out.
legacy_data <- create_mock_data(
databaseStart = "study",
variables = variables,
variable_details = variable_details,
n = 50,
seed = 456,
validate = FALSE,
verbose = TRUE
)validate = FALSE requested; using legacy create_* dispatch.
Filtering for enabled variables...
Found 2 enabled variable(s) for database 'study': age, smoking
Setting random seed: 456
Generating 50 observations...
[1/2] Generating age (integer)
[2/2] Generating smoking (character)
Mock data generation complete!
Rows: 50
Variables: 2
Legacy output is a plain data frame without the v0.4 diagnostics attribute.
The strict and legacy paths should agree on the broad shape of supported data, but exact values can differ.
Understand seed differences
In v0.3, the public seed controlled the legacy generators. In v0.4, the wrapper uses the public seed for baseline generation and seed + 1L for missing-code and garbage-value post-processing.
That makes both stages reproducible, but it means exact values may differ from v0.3 even when you pass the same seed.
strict_again <- create_mock_data(
databaseStart = "study",
variables = variables,
variable_details = variable_details,
n = 50,
seed = 456
)
identical(strict_data, strict_again)[1] TRUE
When testing migrations, compare structure, types, ranges, and proportions rather than expecting row-for-row equality with v0.3 output.
str(strict_data)'data.frame': 50 obs. of 2 variables:
$ age : int 34 57 60 33 41 46 58 999 62 57 ...
$ smoking: chr "1" "1" "1" "1" ...
- attr(*, "mockdata_diagnostics")=List of 2
..$ spec_version: chr "0.4.0"
..$ variables :List of 2
.. ..$ age :List of 6
.. .. ..$ n : int 50
.. .. ..$ preexisting_missing_code_indices: int(0)
.. .. ..$ assigned_missing_indices : int [1:2] 8 13
.. .. ..$ assigned_missing_codes : chr [1:2] "999" "999"
.. .. ..$ assigned_garbage_indices : Named list()
.. .. ..$ assigned_garbage_values : Named list()
.. ..$ smoking:List of 6
.. .. ..$ n : int 50
.. .. ..$ preexisting_missing_code_indices: int(0)
.. .. ..$ assigned_missing_indices : int [1:2] 12 35
.. .. ..$ assigned_missing_codes : chr [1:2] "7" "7"
.. .. ..$ assigned_garbage_indices : Named list()
.. .. ..$ assigned_garbage_values : Named list()
prop.table(table(strict_data$smoking))
1 2 7
0.46 0.50 0.04
Know the fallback conditions
create_mock_data() deliberately uses the legacy path when:
validate = FALSEvariable_details = NULL- detail-level
databaseStartfiltering is needed butvariableshas nodatabaseStartcolumn - the requested metadata uses a feature not yet supported by the v0.4 native backend
For example, variable_details = NULL keeps the simple legacy fallback.
fallback_data <- create_mock_data(
databaseStart = "study",
variables = variables[1, ],
variable_details = NULL,
n = 20,
seed = 789,
verbose = TRUE
)No details file provided - using simple fallback generation
variable_details = NULL; using legacy create_* fallback dispatch.
Filtering for enabled variables...
Found 1 enabled variable(s) for database 'study': age
Setting random seed: 789
Generating 20 observations...
[1/1] Generating age (integer)
Warning: No variable_details rows found for variable 'age' and databaseStart
'study'. Using fallback uniform range [0, 100].
Mock data generation complete!
Rows: 20
Variables: 1
head(fallback_data) age
1 70
2 9
3 1
4 59
5 49
6 2
Unsupported v0.4 backend features also route to legacy dispatch. This example uses an exponential continuous distribution, which remains available through the legacy generator.
exp_variables <- data.frame(
variable = "time_to_visit",
variableType = "Continuous",
rType = "double",
role = "enabled",
distribution = "exponential",
rate = 0.5,
stringsAsFactors = FALSE
)
exp_details <- data.frame(
variable = "time_to_visit",
recStart = "[0, 10]",
recEnd = "copy",
proportion = 1,
stringsAsFactors = FALSE
)
exp_data <- create_mock_data(
databaseStart = "study",
variables = exp_variables,
variable_details = exp_details,
n = 20,
seed = 321,
verbose = TRUE
)v0.4 mock_spec pipeline does not yet support every requested variable; using legacy create_* dispatch. Unsupported variable(s): time_to_visit
Filtering for enabled variables...
Found 1 enabled variable(s) for database 'study': time_to_visit
Setting random seed: 321
Generating 20 observations...
[1/1] Generating time_to_visit (double)
Mock data generation complete!
Rows: 20
Variables: 1
head(exp_data) time_to_visit
1 0.3302437
2 1.4268834
3 2.5103901
4 2.1157332
5 1.7882266
6 1.2263829
Inspect the v0.4 path directly
When debugging a migration, split the wrapper into its three v0.4 steps:
spec <- mock_spec_from_recodeflow(variables, variable_details)
validate_mock_spec(spec, strict = TRUE)MockData mock_spec validation result: valid
baseline <- generate_mock_data_native(spec, n = 50, seed = 456)
postprocessed <- postprocess_mock_data(baseline, spec, seed = 457)
identical(strict_data, postprocessed)[1] TRUE
This makes it easier to tell whether an issue is coming from metadata parsing, baseline generation, or post-processing.
What to check in sibling packages
For cchsflow, chmsflow, and recodeflow workflows, test representative variables.csv and variable_details.csv files with:
mock <- create_mock_data(
databaseStart = "your-cycle",
variables = "variables.csv",
variable_details = "variable_details.csv",
n = 100,
seed = 123,
validate = TRUE,
verbose = TRUE
)
str(mock)
attr(mock, "mockdata_diagnostics")Report cases where metadata unexpectedly falls back to legacy dispatch, where a variable generated in v0.3 but errors in v0.4, or where the generated values, types, or diagnostics are surprising.