# During create_mock_data(), derived variables are excluded
enabled_vars <- variables[grepl("enabled", variables$role), ]
# Identify derived variables
derived_vars <- identify_derived_vars(enabled_vars, variable_details)
# Exclude from generation
enabled_vars <- enabled_vars[!enabled_vars$variable %in% derived_vars, ]
# Result: Only raw variables (height, weight) are generated
# BMI_derived must be calculated separately after generationAbout this vignette: This guide covers advanced technical topics for MockData power users. Read the Getting started guide before this document.
Overview
This guide covers advanced technical topics for MockData power users, including:
- Derived variables and custom functions
- Unique identifiers (UIDs) for traceability
- Multi-database workflows
- Integration with harmonization pipelines
- Performance optimization
Prerequisites: Read the Getting started guide before this document.
Derived variables
Derived variables are calculated from other variables using custom functions. MockData identifies and skips derived variables during generation, leaving them for post-generation calculation.
Identifying derived variables
Derived variables use special patterns in variable_details.csv:
-
recStartcontainsDerivedVar::[var1, var2, ...] -
recEndcontainsFunc::function_name
Example:
uid,uid_detail,variable,recStart,recEnd,catLabel
cchsflow_v0006,cchsflow_d00016,BMI_derived,"DerivedVar::[height, weight]","Func::bmi_fun","BMI calculated from height and weight"
How it works
Creating derived variables post-generation
After generating mock data, calculate derived variables using your custom functions:
# 1. Generate raw variables
mock_data <- create_mock_data(
databaseStart = "minimal-example",
variables = variables,
variable_details = variable_details,
n = 1000
)
# Result: Contains height, weight (but not BMI_derived)
# 2. Calculate derived variables
mock_data$BMI_derived <- bmi_fun(
height = mock_data$height,
weight = mock_data$weight
)
# 3. Verify
summary(mock_data$BMI_derived)Example: Custom BMI calculation
Benefits of derived variables
- Separation of concerns: Raw data generation vs. business logic
- Reusability: Same derivation logic for mock and real data
- Testing: Verify derivation functions with known inputs
- Documentation: Metadata explicitly documents dependencies
Unique identifiers (UIDs)
MockData uses unique identifiers to track variables and categories throughout the metadata lifecycle. UIDs provide traceability and version control for harmonization workflows.
UID structure
Variable-level UIDs (uid column in variables.csv):
uid,variable,variableShortLabel,rType
cchsflow_v0001,age,Age at interview,integer
cchsflow_v0002,smoking,Smoking status,factor
ices_v01,interview_date,Interview date,date
Detail-level UIDs (uid_detail column in variable_details.csv):
uid,uid_detail,variable,recStart,recEnd,catLabel
cchsflow_v0001,cchsflow_d00001,age,"[18,100]","copy","Valid age range"
cchsflow_v0001,cchsflow_d00002,age,"997","NA::b","Don't know"
cchsflow_v0002,cchsflow_d00005,smoking,"1","1","Never smoker"
cchsflow_v0002,cchsflow_d00006,smoking,"2","2","Former smoker"
UID naming conventions
Variable UIDs:
- Format:
{project}_{version}{number} - Examples:
cchsflow_v0001,ices_v01,chmsflow_v0042
Detail UIDs:
- Format:
{project}_d{number} - Examples:
cchsflow_d00001,ices_d003,chmsflow_d00156
Benefits of UIDs
- Traceability: Track variable definitions across metadata versions
- Version control: Identify when variables changed
- Cross-referencing: Link variables across databases
- Debugging: Quickly locate specific category rules
- Documentation: Permanent identifiers for scientific publications
Example: Tracking variable evolution
# Load current metadata
variables_v2 <- read.csv("variables_v2.csv")
# Load previous metadata
variables_v1 <- read.csv("variables_v1.csv")
# Find variables that changed between versions
changed_vars <- anti_join(variables_v2, variables_v1, by = c("uid", "rType"))
# Result: Variables with same UID but different specificationsMulti-database workflows
MockData supports generating data for multiple databases/cycles using the databaseStart parameter. This enables testing harmonization code across survey cycles.
Database filtering
The databaseStart column in variable_details.csv specifies which databases each category applies to:
uid,uid_detail,variable,databaseStart,recStart,recEnd,catLabel
ices_v01,ices_d001,interview_date,minimal-example,"[2001-01-01,2005-12-31]","copy","Interview date range"
cchsflow_v0001,cchsflow_d00001,age,"cchs2001_p,cchs2005_p","[18,100]","copy","Valid age range"
Generating data for multiple databases
Single database:
# Generate for specific database
mock_cchs2001 <- create_mock_data(
databaseStart = "cchs2001_p",
variables = variables,
variable_details = variable_details,
n = 1000
)Multiple databases:
# Generate for multiple cycles
databases <- c("cchs2001_p", "cchs2005_p", "cchs2009_p")
mock_data_list <- lapply(databases, function(db) {
create_mock_data(
databaseStart = db,
variables = variables,
variable_details = variable_details,
n = 1000,
seed = 123 # Same seed for consistency
)
})
names(mock_data_list) <- databasesDatabase-specific category rules
Different databases may have different category codes for the same variable:
uid,uid_detail,variable,databaseStart,recStart,recEnd,catLabel
cchsflow_v0002,cchsflow_d00005,smoking,cchs2001_p,"1","1","Never smoker"
cchsflow_v0002,cchsflow_d00006,smoking,cchs2001_p,"2","2","Former smoker"
cchsflow_v0002,cchsflow_d00007,smoking,cchs2005_p,"01","1","Never smoker"
cchsflow_v0002,cchsflow_d00008,smoking,cchs2005_p,"02","2","Former smoker"
MockData automatically filters to the correct rules based on databaseStart.
Benefits of multi-database support
- Test harmonization: Verify code works across survey cycles
- Compare databases: Generate comparable mock datasets
- Version management: Track database-specific variations
- Batch generation: Create test data for entire project at once
Duplicate prevention: How df_mock works
Implementation
All generator functions check if a variable already exists before creating it:
Why this matters
Without duplicate checking:
# Dangerous - creates duplicate columns
for (i in 1:3) {
df <- cbind(df, create_cat_var("SMK_01", ...))
}
# Result: df has SMK_01, SMK_01.1, SMK_01.2With duplicate checking:
# Safe - only creates variable once
for (i in 1:3) {
col <- create_cat_var("SMK_01", ..., df_mock = df)
if (!is.null(col)) df <- cbind(df, col)
}
# Result: df has SMK_01 (created once, subsequent calls return NULL)Design rationale
Current approach (explicit control):
- Pro: Explicit control over data frame construction
- Pro: NULL return signals “variable exists” (useful for debugging)
- Pro: Compatible with both standalone and batch generation modes
-
Con: Requires
if (!is.null(col)) df <- cbind(df, col)pattern
Note: create_mock_data() handles this internally, so most users won’t need to worry about duplicate checking.
Integration with harmonization workflows
MockData is designed to work with the CCHS/CHMS harmonization ecosystem (cchsflow, chmsflow).
Typical workflow
- Metadata preparation: Use recodeflow metadata format (variables.csv, variable_details.csv)
- Mock data generation: Use MockData to create test datasets
- Harmonization development: Test harmonization code with mock data
- Validation: Verify harmonization logic before applying to real data
- Production: Apply harmonization to real CCHS/CHMS data
Example: Testing harmonization code
# 1. Generate mock raw data
mock_raw <- create_mock_data(
databaseStart = "cchs2001_p",
variables = variables,
variable_details = variable_details,
n = 1000,
seed = 123
)
# 2. Apply harmonization using rec_with_table()
harmonized <- rec_with_table(
data = mock_raw,
variables = variables,
variable_details = variable_details,
databaseStart = "cchs2001_p"
)
# 3. Validate harmonization logic
library(testthat)
test_that("harmonization handles NA codes correctly", {
# Check that missing codes (996-999) are converted to NA
expect_true(all(harmonized$smoking %in% c(1, 2, 3, NA)))
expect_false(any(harmonized$smoking %in% c(996, 997, 998, 999), na.rm = TRUE))
})
test_that("age range is valid", {
valid_ages <- harmonized$age[!is.na(harmonized$age)]
expect_true(all(valid_ages >= 18 & valid_ages <= 100))
})Benefits of mock data for harmonization
- Faster development: No need to access restricted data for testing
- Reproducible testing: Same mock data every time (use seed parameter)
- Edge case testing: Easy to create extreme scenarios (prop_garbage parameter)
- Documentation: Mock data examples clarify harmonization logic
- CI/CD integration: Automated testing without data access restrictions
Performance considerations
For large-scale mock data generation:
Optimization strategies
1. Generate in batches:
# Instead of one large generation
result <- create_con_var(..., n = 1000000)
# Generate in batches
batch_size <- 100000
batches <- ceiling(1000000 / batch_size)
result_list <- lapply(1:batches, function(i) {
create_con_var(
...,
n = batch_size,
df_mock = data.frame(id = ((i-1)*batch_size + 1):(i*batch_size))
)
})
result <- bind_rows(result_list)2. Simplify distributions:
# Uniform is faster than normal (for continuous variables)
distribution = "uniform" # Faster
distribution = "normal" # Slower (normal distribution centered at range midpoint)3. Minimize metadata:
Current limitations
- Large datasets (>1M rows) may be slow
- Complex metadata with many variables requires more processing
- Normal distributions slower than uniform for continuous variables
Best practices
Seed management for reproducibility
Use different seeds for different variables to ensure independence while maintaining reproducibility:
# Generate multiple date variables with different seeds
birth_dates <- create_date_var(
var = "birth_date",
databaseStart = "minimal-example",
variables = variables,
variable_details = variable_details,
n = 1000,
seed = 100
)
death_dates <- create_date_var(
var = "death_date",
databaseStart = "minimal-example",
variables = variables,
variable_details = variable_details,
n = 1000,
seed = 101 # Different seed ensures independence
)
diagnosis_dates <- create_date_var(
var = "diagnosis_date",
databaseStart = "minimal-example",
variables = variables,
variable_details = variable_details,
n = 1000,
seed = 102 # Different seed
)Why use different seeds:
- Ensures variables are statistically independent
- Prevents unwanted correlations between variables
- Maintains reproducibility (same seed = same data every time)
Recommended seed strategy:
# Use sequential seeds starting from a base value
base_seed <- 1000
mock_data <- create_mock_data(
databaseStart = "minimal-example",
variables = variables,
variable_details = variable_details,
n = 5000,
seed = base_seed
)
# For individual variable generation:
# var1: seed = base_seed + 0 (1000)
# var2: seed = base_seed + 1 (1001)
# var3: seed = base_seed + 2 (1002)Document your seeds:
Always document the seed values used for data generation in your code comments or project documentation. This ensures others can reproduce your exact mock datasets.
Troubleshooting
Common issues
Issue: “Variable not found in metadata”
Issue: “No valid categories found”
# Check recStart values
var_details %>% filter(variable == "problem_var") %>% select(recStart, recEnd)
# Ensure not all rules are filtered (copy, else)Issue: “prop_NA doesn’t work”
# Verify NA codes exist in metadata
na_codes <- get_variable_categories(variable_details, include_na = TRUE)If na_codes is empty, no NA codes are available in the metadata. Add NA codes (typically 996-999) to variable_details with appropriate recStart/recEnd values.
Getting help
- Check function documentation:
?create_cat_var,?create_con_var,?create_date_var,?create_mock_data - Review Getting started for basic concepts
- Learn about configuration reference for complete metadata schema
- Understand missing data handling in health surveys
- See MockData for recodeflow users for harmonization workflows
- Open an issue on GitHub with reproducible example
Next steps
Tutorials:
- Getting started - Learn MockData basics
- Working with date variables - Date generation and interval notation
- Handling missing data - Missing codes and proportions
- Testing data quality and validation - Generating garbage data for QA
- Generating survival data with competing risks - Time-to-event data
Reference:
- Configuration reference - Complete metadata schema documentation
- MockData for recodeflow users - Integration with cchsflow/chmsflow
Contributing:
- Apply these concepts to your harmonization projects
- Contribute improvements to MockData on GitHub