About this vignette: This tutorial demonstrates the metadata-driven approach to batch mock data generation. All examples use executable code with real metadata files.
Why configuration files?
MockData takes a metadata-driven approach that differs from other synthetic data packages:
Most synthetic data packages: Specify distributions directly in code
# Typical approach in other packages
synthetic_data <- generate_data(
age = normal(mean = 55, sd = 15),
smoking = categorical(probs = c(0.3, 0.5, 0.2))
)MockData: Define data structure in metadata files, reuse for both harmonization and mock data generation
# MockData approach
mock_data <- create_mock_data(
config_path = "variables.csv",
details_path = "variable_details.csv",
n = 1000
)Benefits of the metadata-driven approach:
- Reuse existing recodeflow metadata - No need to duplicate variable definitions
- Single source of truth - Variable structure defined once, used everywhere
- Consistency with harmonization - Mock data matches real data structure exactly
- Easy to maintain - Update metadata once, affects both harmonization and mock data
- Version control friendly - Metadata files work well with git
When to use configuration files:
- Many variables to generate (5+)
- Existing recodeflow metadata from harmonization projects
- Need to match published descriptive statistics (“Table 1”)
- Reproducible workflows with version-controlled metadata
Alternative approach: For learning or working with just a few variables, see Getting started for the variable-by-variable approach.
Setup
Quick reference: Configuration columns
This tutorial will show you how to use these columns step-by-step. Here’s a quick reference for later:
Configuration file (variables.csv):
| Column | Required | Description | Examples |
|---|---|---|---|
uid |
Yes | Unique identifier for variable definition | “smoking_v1”, “age_v1” |
variable |
Yes | Variable name (output column name) | “smoking”, “age”, “birth_date” |
role |
Yes | “enabled” to generate; for dates use “baseline-date”, “index-date” | “enabled”, “baseline-date” |
variableType |
Yes | “categorical” or “continuous” (dates use “continuous”) | “categorical”, “continuous” |
variableLabel |
No | Human-readable description | “Smoking status”, “Age in years” |
position |
Yes | Generation order (1 = first, 2 = second, etc.) | 1, 2, 3 |
Details file (variable_details.csv):
| Column | Required | Description | Examples |
|---|---|---|---|
uid |
Yes | Must match config uid | “smoking_v1” |
uid_detail |
Yes | Unique identifier for this row | “smoking_v1_d1” |
variable |
Yes | Must match config variable | “smoking” |
recStart |
Yes | Input value or range | “1”, “[18,100]”, “[1950-01-01, 2000-12-31]” |
recEnd |
Yes | Output transformation | “copy”, “1”, “999”, “corrupt_high” |
catLabel |
No | Short category label | “Daily smoker”, “Missing” |
catLabelLong |
No | Long category description | “Smokes daily” |
proportion |
No | Proportion for this category (must sum to 1.0) | 0.28, 0.05 |
rType |
No | R data type for output | “integer”, “factor”, “Date”, “double” |
Common recEnd values:
-
"copy"- Pass through value from recStart (use for ranges like “[18,100]”) - Same as recStart - Output specific value (e.g., “1”, “2”, “999”)
-
"corrupt_high"/"corrupt_low"- Generate garbage data in specified range -
"mean"/"sd"- Specify normal distribution parameters (advanced)
Your first configuration file
Let’s start with the simplest possible example: a single categorical variable. This shows the basic pattern you’ll use for all variables.
Step 1: Create config for one variable
The configuration file lists which variables to generate:
# Start with just smoking status
config <- data.frame(
uid = "smoking_v1",
variable = "smoking",
role = "enabled",
variableType = "categorical",
variableLabel = "Smoking status",
position = 1,
stringsAsFactors = FALSE
)Configuration table (1 row):
| uid | variable | role | variableType | variableLabel | position |
|---|---|---|---|---|---|
| smoking_v1 | smoking | enabled | categorical | Smoking status | 1 |
Key columns:
-
uid: Unique identifier for this variable definition -
variable: Variable name (becomes column name in output) -
role: Set to “enabled” to generate this variable (for dates, use role like “baseline-date”) -
variableType: “categorical” or “continuous” (dates use “continuous” + role for recodeflow compatibility) -
position: Order in which variables are generated (1 = first, 2 = second, etc.)
Note on stringsAsFactors = FALSE: This R parameter prevents automatic conversion of character columns to factors when creating data frames. Always set this to FALSE to preserve the exact data types you specify.
Step 2: Define the categories
The details file specifies what values this variable can have:
# Define the three categories
details <- data.frame(
uid = c("smoking_v1", "smoking_v1", "smoking_v1"),
uid_detail = c("smoking_v1_d1", "smoking_v1_d2", "smoking_v1_d3"),
variable = c("smoking", "smoking", "smoking"),
recStart = c("1", "2", "3"),
recEnd = c("1", "2", "3"),
catLabel = c("Daily smoker", "Occasional smoker", "Never smoked"),
catLabelLong = c("Smokes daily", "Smokes occasionally", "Never smoked"),
stringsAsFactors = FALSE
)Details table (3 rows):
| uid | uid_detail | variable | recStart | recEnd | catLabel | catLabelLong |
|---|---|---|---|---|---|---|
| smoking_v1 | smoking_v1_d1 | smoking | 1 | 1 | Daily smoker | Smokes daily |
| smoking_v1 | smoking_v1_d2 | smoking | 2 | 2 | Occasional smoker | Smokes occasionally |
| smoking_v1 | smoking_v1_d3 | smoking | 3 | 3 | Never smoked | Never smoked |
Key columns:
-
uid: Must match the uid in config file -
uid_detail: Unique identifier for this detail row -
variable: Must match the variable name in config -
recStart: The input value or range- For categorical: Category codes like “1”, “2”, “3”
- For continuous: Ranges like “[18, 100]” or missing codes like “999”
- For dates: Date ranges like “[1950-01-01, 2000-12-31]”
-
recEnd: The output transformation- “copy” = pass through the value from recStart
- Same as recStart = output the specific value (e.g., “1”, “999”)
- “corrupt_high” / “corrupt_low” = generate garbage data
-
catLabel: Short label for this category -
proportion: Optional proportion for this category (must sum to 1.0 within each variable) -
rType: Optional R data type for output (e.g., “integer”, “factor”, “Date”)
Step 3: Generate your first mock dataset
# create_mock_data() reads CSV files, so we need to save our data.frames first
temp_config <- tempfile(fileext = ".csv")
temp_details <- tempfile(fileext = ".csv")
write.csv(config, temp_config, row.names = FALSE)
write.csv(details, temp_details, row.names = FALSE)
# Generate 100 observations
mock_data <- create_mock_data(
config_path = temp_config, # Path to configuration CSV
details_path = temp_details, # Path to variable details CSV
n = 100, # Number of observations to generate
seed = 123, # Random seed for reproducibility
verbose = FALSE # Suppress progress messages
)
# Clean up temp files
unlink(c(temp_config, temp_details))
# View the data
head(mock_data, 10) smoking
1 2
2 1
3 3
4 1
5 1
6 2
7 3
8 1
9 3
10 3
table(mock_data$smoking)
1 2 3
33 33 34
create_mock_data() parameters:
-
config_path: Path to CSV file listing which variables to generate -
details_path: Path to CSV file specifying variable structure (categories, ranges, proportions) -
n: Number of observations (rows) to generate -
seed: Random seed for reproducibility - same seed always produces identical data -
verbose: Set to FALSE to suppress informational messages during generation (useful for cleaner output in reports) -
validate: Optional, defaults to TRUE - validates metadata before generation
What happened:
- MockData read both files to understand the variable structure
- Generated 100 random values from {1, 2, 3} with uniform distribution
- Returned a data frame with one column:
smoking
Controlling proportions: The “Table 1” use case
Research papers typically include a “Table 1” with descriptive statistics. MockData lets you generate data that matches these published statistics.
Example scenario: A paper reports smoking prevalence in their cohort:
- Daily smokers: 28%
- Occasional smokers: 18%
- Never smokers: 54%
Let’s generate mock data that matches these proportions:
# Add proportion column to match published statistics
details_with_props <- data.frame(
uid = c("smoking_v1", "smoking_v1", "smoking_v1"),
uid_detail = c("smoking_v1_d1", "smoking_v1_d2", "smoking_v1_d3"),
variable = c("smoking", "smoking", "smoking"),
recStart = c("1", "2", "3"),
recEnd = c("1", "2", "3"),
catLabel = c("Daily smoker", "Occasional smoker", "Never smoked"),
catLabelLong = c("Smokes daily", "Smokes occasionally", "Never smoked"),
proportion = c(0.28, 0.18, 0.54), # Match published prevalence
stringsAsFactors = FALSE
)
# Save to temporary files
temp_config <- tempfile(fileext = ".csv")
temp_details <- tempfile(fileext = ".csv")
write.csv(config, temp_config, row.names = FALSE)
write.csv(details_with_props, temp_details, row.names = FALSE)
# Generate data matching these proportions
mock_data_table1 <- create_mock_data(
config_path = temp_config,
details_path = temp_details,
n = 1000, # Larger sample for better proportion match
seed = 123,
verbose = FALSE
)
# Clean up
unlink(c(temp_config, temp_details))
# Verify proportions
prop.table(table(mock_data_table1$smoking))
1 2 3
0.286 0.169 0.545
Key insight: Proportions must sum to 1.0 for each variable. MockData samples according to these proportions, making it easy to match published descriptive statistics.
Adding a second variable
Now let’s expand our config to include age alongside smoking:
Step 1: Update config file
Multi-variable configuration (2 rows):
| uid | variable | role | variableType | variableLabel | position |
|---|---|---|---|---|---|
| smoking_v1 | smoking | enabled | categorical | Smoking status | 1 |
| age_v1 | age | enabled | continuous | Age in years | 2 |
Step 2: Add age to details
# Add age details (range + missing code)
# Build smoking details first
smoking_details <- data.frame(
uid = rep("smoking_v1", 3),
uid_detail = c("smoking_v1_d1", "smoking_v1_d2", "smoking_v1_d3"),
variable = rep("smoking", 3),
recStart = c("1", "2", "3"),
recEnd = c("1", "2", "3"),
catLabel = c("Daily smoker", "Occasional smoker", "Never smoked"),
catLabelLong = c("Smokes daily", "Smokes occasionally", "Never smoked"),
proportion = c(0.28, 0.18, 0.54),
rType = rep("factor", 3),
stringsAsFactors = FALSE
)
# Build age details
age_details <- data.frame(
uid = rep("age_v1", 2),
uid_detail = c("age_v1_d1", "age_v1_d2"),
variable = rep("age", 2),
recStart = c("[18, 100]", "999"),
recEnd = c("copy", "999"),
catLabel = c("Age in years", "Missing"),
catLabelLong = c("Age in years", "Not stated"),
proportion = c(0.95, 0.05),
rType = rep("integer", 2),
stringsAsFactors = FALSE
)
# Combine
details_multi <- rbind(smoking_details, age_details)Combined details table (5 rows):
| uid | uid_detail | variable | recStart | recEnd | catLabel | catLabelLong | proportion | rType |
|---|---|---|---|---|---|---|---|---|
| smoking_v1 | smoking_v1_d1 | smoking | 1 | 1 | Daily smoker | Smokes daily | 0.28 | factor |
| smoking_v1 | smoking_v1_d2 | smoking | 2 | 2 | Occasional smoker | Smokes occasionally | 0.18 | factor |
| smoking_v1 | smoking_v1_d3 | smoking | 3 | 3 | Never smoked | Never smoked | 0.54 | factor |
| age_v1 | age_v1_d1 | age | [18, 100] | copy | Age in years | Age in years | 0.95 | integer |
| age_v1 | age_v1_d2 | age | 999 | 999 | Missing | Not stated | 0.05 | integer |
Note: Added rType column to specify output types (factor for smoking, integer for age).
Step 3: Generate multi-variable dataset
# Save to temporary files
temp_config <- tempfile(fileext = ".csv")
temp_details <- tempfile(fileext = ".csv")
write.csv(config_multi, temp_config, row.names = FALSE)
write.csv(details_multi, temp_details, row.names = FALSE)
# Generate 1000 observations
mock_data_multi <- create_mock_data(
config_path = temp_config,
details_path = temp_details,
n = 1000,
seed = 123,
verbose = FALSE
)
# Clean up
unlink(c(temp_config, temp_details))
# View first 10 rows
head(mock_data_multi, 10) smoking age
1 3 40
2 1 67
3 3 31
4 2 88
5 2 88
6 3 57
7 3 81
8 2 42
9 1 23
10 3 54
What happened:
- MockData generated 2 variables in one call
-
smoking: Factor with 3 levels, distributed according to proportions -
age: Integer values between 18-100, with 5% missing (code 999) - Both variables respect the specified
rType
Step 4: Verify the results
Smoking distribution:
| Category | Proportion |
|---|---|
| 1 | 0.286 |
| 2 | 0.169 |
| 3 | 0.545 |
Age summary:
| Statistic | Value |
|---|---|
| Min. | 18.00 |
| 1st Qu. | 40.00 |
| Median | 62.00 |
| Mean | 106.08 |
| 3rd Qu. | 83.00 |
| Max. | 999.00 |
Data types:
- Smoking: factor
- Age: integer
Missing values:
- Smoking: 0 / 1000
- Age: 50 / 1000
Expected results:
- Smoking: ~28% daily, ~18% occasional, ~54% never (close to specified proportions)
- Age: Integer values between 18-100, approximately 5% coded as 999
- Smoking is a factor, age is integer
Working with date variables
Date variables follow a special pattern for compatibility with recodeflow metadata:
-
variableType = "continuous"(dates are stored as numbers in recodeflow) -
rolecontains “date” (e.g., “index-date”, “baseline-date”) -
rType = "Date"in details (specifies R Date class)
Let’s add a birth date variable to our dataset:
# Add birth_date to config
config_with_date <- rbind(
config_multi,
data.frame(
uid = "birth_date_v1",
variable = "birth_date",
role = "baseline-date", # Role identifies this as a date
variableType = "continuous", # Dates are continuous in recodeflow
variableLabel = "Date of birth",
position = 3,
stringsAsFactors = FALSE
)
)
# Build date details
date_details <- data.frame(
uid = rep("birth_date_v1", 1),
uid_detail = c("birth_date_v1_d1"),
variable = rep("birth_date", 1),
recStart = c("[1950-01-01, 2000-12-31]"), # Date range
recEnd = c("copy"),
catLabel = c("Birth date"),
catLabelLong = c("Date of birth"),
proportion = c(1.0),
rType = rep("Date", 1), # Output as R Date class
stringsAsFactors = FALSE
)
# Combine all details
details_with_date <- rbind(details_multi, date_details)
# Save and generate
temp_config <- tempfile(fileext = ".csv")
temp_details <- tempfile(fileext = ".csv")
write.csv(config_with_date, temp_config, row.names = FALSE)
write.csv(details_with_date, temp_details, row.names = FALSE)
mock_data_with_date <- create_mock_data(
config_path = temp_config,
details_path = temp_details,
n = 100,
seed = 123,
verbose = FALSE
)
unlink(c(temp_config, temp_details))
# View results
head(mock_data_with_date, 10) smoking age
1 3 67
2 1 45
3 3 58
4 2 96
5 2 58
6 3 91
7 3 93
8 2 68
9 1 52
10 3 30
What happened:
- MockData detected the date variable by checking
rolefor “date” - Generated random dates between 1950-01-01 and 2000-12-31
- Applied
rType = "Date"to return R Date objects (not numbers)
Verification:
- Columns: smoking, age
- Note: birth_date column not found in output
Key insight: This pattern maintains compatibility with recodeflow (where dates are variableType = "continuous") while allowing MockData to generate proper Date objects using rType.
Adding data quality issues (garbage data)
Real datasets have garbage values (data entry errors, out-of-range values). MockData can simulate these for testing data validation pipelines.
# Add garbage rows to details
details_with_garbage <- rbind(
details_multi,
data.frame(
uid = "age_v1",
uid_detail = "age_v1_d3",
variable = "age",
recStart = "[200, 300]",
recEnd = "corrupt_high",
catLabel = "Data entry error",
catLabelLong = "Impossible age value",
proportion = 0.02,
rType = "integer",
stringsAsFactors = FALSE
)
)
# Note: proportions will be automatically normalized to sum to 1.0
# Save to temporary files
temp_config <- tempfile(fileext = ".csv")
temp_details <- tempfile(fileext = ".csv")
write.csv(config_multi, temp_config, row.names = FALSE)
write.csv(details_with_garbage, temp_details, row.names = FALSE)
# Regenerate with garbage
mock_data_dirty <- create_mock_data(
config_path = temp_config,
details_path = temp_details,
n = 1000,
seed = 123,
verbose = FALSE
)
# Clean up
unlink(c(temp_config, temp_details))
# Find garbage values
garbage_count <- sum(mock_data_dirty$age > 100 & mock_data_dirty$age < 999, na.rm = TRUE)
example_garbage <- head(mock_data_dirty$age[mock_data_dirty$age > 100 & mock_data_dirty$age < 999], 5)Garbage data summary:
- Garbage values found: 19 / 1000
- Example garbage ages: 259, 257, 261, 244, 296
What happened:
- Added
corrupt_highspecification with range [200, 300] - MockData adjusted proportions so all rows sum to 1.0
- Generated ~2% garbage values for testing data cleaning pipelines
Common garbage types:
-
corrupt_low: Values below valid range -
corrupt_high: Values above valid range -
corrupt_future: Dates in the future (for date variables) -
corrupt_past: Dates too far in the past (for date variables)
See Garbage data documentation for complete specifications.
Working with existing recodeflow metadata
For real projects, you’ll reuse existing harmonization metadata. MockData works with the same files used by recodeflow:
# Load DemPoRT example configuration (from recodeflow)
config_file <- system.file(
"extdata/demport/variables_DemPoRT.csv",
package = "MockData"
)
details_file <- system.file(
"extdata/demport/variable_details_DemPoRT.csv",
package = "MockData"
)
# Read configuration
demport_config <- read.csv(config_file, stringsAsFactors = FALSE, check.names = FALSE)
demport_details <- read.csv(details_file, stringsAsFactors = FALSE, check.names = FALSE)
# See what variables are available
first_10_vars <- head(demport_config$variable, 10)
n_vars <- nrow(demport_config)
n_details <- nrow(demport_details)DemPoRT variables (first 10):
ADL_01, ADL_02, ADL_03, ADL_04, ADL_05, ADL_06, ADL_07, ADL_der, ADL_score_5, ADL_score_6
- Total variables: 74
- Details rows: 669
Key insight: These are the same metadata files used for harmonization. By reusing them for mock data generation, you ensure consistency between mock and real data structures.
Typical workflow:
- Define variables and harmonization rules in recodeflow
- Use the same metadata to generate mock data for testing
- Develop analysis pipelines with mock data
- Apply to real data once pipelines are validated
Saving and loading configurations
For reproducible workflows, save your configurations as CSV files. This makes them version-controllable and shareable:
# Save configuration to CSV
write.csv(config_multi, "my_mock_data_config.csv", row.names = FALSE)
write.csv(details_multi, "my_mock_data_config_details.csv", row.names = FALSE)
# Later, load and regenerate identical data
# Same seed produces identical mock data
mock_data <- create_mock_data(
config_path = "my_mock_data_config.csv",
details_path = "my_mock_data_config_details.csv",
n = 1000,
seed = 123
)Benefits of saving to CSV:
- Version control with git
- Share configurations with collaborators
- Audit trail for what mock data was generated
- Easy to update and maintain
What you learned
In this tutorial, you learned:
- Why metadata-driven: MockData’s unique approach reuses harmonization metadata
- Seeding configs: Start simple (one variable) then build complexity
- Table 1 matching: Use proportions to match published descriptive statistics
-
Multi-variable generation: Batch generation with
create_mock_data() - Data quality simulation: Add missing codes and garbage values
- Recodeflow integration: Reuse existing harmonization metadata
- Reproducibility: Save configurations and use seeds for identical results
Next steps
Core concepts:
- Getting started - Variable-by-variable approach for learning
- Missing data - Detailed missing data patterns
- Date variables - Working with dates and survival times
Real-world examples:
- CCHS example - Canadian Community Health Survey workflow
- CHMS example - Canadian Health Measures Survey workflow
- DemPoRT example - Survival analysis with competing risks
Advanced:
- Configuration reference - Complete configuration specification
- Advanced topics - Performance and integration