About this vignette
This introductory tutorial teaches core MockData concepts through progressive examples. All code runs during vignette build to ensure accuracy. The generated data are for testing and development only—not for modelling or analysis.
What is MockData?
MockData generates metadata-driven mock datasets for testing and developing harmonisation workflows. Mock data are created solely from variable specifications and contain no real person-level data or identifiable information.
Key purposes
- Testing harmonisation code (cchsflow, chmsflow) without access to real survey data
- Developing data pipelines with realistic variable structures before data access
- Training and education with representative but non-sensitive data
- Validating data processing workflows with controlled test inputs
What mock data are (and are not)
MockData reads recodeflow metadata files (variables.csv, variable-details.csv) to generate data that mimics the variable structure of health survey datasets like CCHS and CHMS. The data have appropriate types, value ranges, and category labels—but no real-world statistical relationships.
Limitations: While variable types and ranges match the metadata, joint distributions and correlations may differ significantly from real-world data. Mock data should never be used for population inference, epidemiological modelling, or research publication.
Your first mock dataset
This tutorial walks you through generating a simple mock dataset with both categorical and continuous variables.
Setup
Step 1: Prepare metadata
MockData uses two metadata tables:
- variables: defines which variables exist in each database cycle
- variable_details: defines categories, ranges, and recode rules
For this tutorial, we’ll use a simple example with two variables: smoking status (categorical) and age (continuous) .
# Define variables table
variables <- data.frame(
variable = c("smoking", "smoking", "age", "age"),
variableStart = c("SMK_01", "SMK_01", "AGE_01", "AGE_01"),
databaseStart = c("cycle1", "cycle2", "cycle1", "cycle2"),
databaseEnd = c("cycle1", "cycle2", "cycle1", "cycle2"),
variableType = c("categorical", "categorical", "continuous", "continuous")
)
# Define variable details (categories and ranges)
variable_details <- data.frame(
variable = c("smoking", "smoking", "smoking",
"smoking", "smoking", "smoking", "smoking",
"age", "age", "age"),
recStart = c("1", "2", "3", "996", "997", "998", "999",
"[18, 100]", "996", "[997, 999]"),
recEnd = c("1", "2", "3", "996", "997", "998", "999",
"copy", "NA::a", "NA::b"),
catLabel = c("Daily smoker", "Occasional smoker", "Never smoked",
"Not applicable", "Don't know", "Refusal", "Not stated",
"Age in years", "Not applicable", "Missing"),
variableStart = c("SMK_01", "SMK_01", "SMK_01", "SMK_01", "SMK_01", "SMK_01", "SMK_01",
"AGE_01", "AGE_01", "AGE_01"),
databaseStart = c("cycle1", "cycle1", "cycle1", "cycle1", "cycle1", "cycle1", "cycle1",
"cycle1", "cycle1", "cycle1"),
rType = c("factor", "factor", "factor", "factor", "factor", "factor", "factor",
"integer", "integer", "integer")
)Variables table (4 rows):
| variable | variableStart | databaseStart | databaseEnd | variableType |
|---|---|---|---|---|
| smoking | SMK_01 | cycle1 | cycle1 | categorical |
| smoking | SMK_01 | cycle2 | cycle2 | categorical |
| age | AGE_01 | cycle1 | cycle1 | continuous |
| age | AGE_01 | cycle2 | cycle2 | continuous |
Variable details table (10 rows):
| variable | recStart | recEnd | catLabel | variableStart | databaseStart | rType |
|---|---|---|---|---|---|---|
| smoking | 1 | 1 | Daily smoker | SMK_01 | cycle1 | factor |
| smoking | 2 | 2 | Occasional smoker | SMK_01 | cycle1 | factor |
| smoking | 3 | 3 | Never smoked | SMK_01 | cycle1 | factor |
| smoking | 996 | 996 | Not applicable | SMK_01 | cycle1 | factor |
| smoking | 997 | 997 | Don’t know | SMK_01 | cycle1 | factor |
| smoking | 998 | 998 | Refusal | SMK_01 | cycle1 | factor |
| smoking | 999 | 999 | Not stated | SMK_01 | cycle1 | factor |
| age | [18, 100] | copy | Age in years | AGE_01 | cycle1 | integer |
| age | 996 | NA::a | Not applicable | AGE_01 | cycle1 | integer |
| age | [997, 999] | NA::b | Missing | AGE_01 | cycle1 | integer |
Step 2: Generate a categorical variable with custom proportions
Real survey data has different types of missing values. Use the proportions parameter to explicitly specify the distribution for all categories, including missing codes:
# Create mock data frame
df_mock <- data.frame(id = 1:1000)
# Generate smoking variable with explicit proportions for ALL categories
smoking_col <- create_cat_var(
var_raw = "SMK_01",
cycle = "cycle1",
variable_details = variable_details,
variables = variables,
length = 1000,
df_mock = df_mock,
proportions = list(
"1" = 0.30, # Daily smoker
"2" = 0.50, # Occasional smoker
"3" = 0.15, # Never smoked
"996" = 0.01, # Not applicable (valid skip)
"997" = 0.01, # Don't know
"998" = 0.02, # Refusal
"999" = 0.01 # Not stated
)
)
# Add to data frame
df_mock <- cbind(df_mock, smoking_col)
# View distribution
table(df_mock$SMK_01)
1 2 3 996 997 998 999
312 472 156 7 13 32 8
What happened:
- MockData extracted all 7 categories from variable_details (1, 2, 3, 996, 997, 998, 999)
- Generated 1000 random values distributed according to the specified proportions
- The
proportionsparameter gives you full control over the distribution, including missing data codes - Categories 1-3 are valid responses, while 996-999 are different types of missing data
For further discussion on making mock missing data see Missing data in health surveys.
Step 3: Generate continuous variable
Use create_con_var() to generate continuous variables like age.
# Generate age variable
age_col <- create_con_var(
var_raw = "AGE_01",
cycle = "cycle1",
variable_details = variable_details,
variables = variables,
length = 100,
df_mock = df_mock,
distribution = "uniform" # Uniform distribution within range [18, 100]
)
# Add to data frame
df_mock <- cbind(df_mock, age_col)
# View results
head(df_mock, 10) id SMK_01 AGE_01
1 1 3 38
2 2 1 77
3 3 2 52
4 4 2 40
5 5 2 94
6 6 1 38
7 7 1 59
8 8 3 59
9 9 2 70
10 10 2 69
summary(df_mock$AGE_01) Min. 1st Qu. Median Mean 3rd Qu. Max.
18.00 40.00 63.50 60.93 79.25 99.00
What happened:
- MockData extracted the range from variable_details [18, 100]
- Generated 100 random ages uniformly distributed between 18 and 100
- Returned a single-column data frame that we added to df_mock
Step 4: Working with configuration files
For larger projects, MockData supports batch generation using configuration CSV files instead of inline data frames. This makes it easier to generate many variables at once.
# Read configuration files (not run in this tutorial)
config <- read_mock_data_config("mock_data_config.csv")
details <- read_mock_data_config_details("mock_data_config_details.csv")
# Generate all variables in one call
mock_data <- create_mock_data(
config = config,
details = details,
n = 1000,
seed = 123
)Why use config files:
- Generate dozens of variables in a single call
- Easier to maintain and version control metadata
- Consistent with recodeflow harmonization workflows
- Supports advanced features like derived variables and garbage data
See CCHS example, CHMS example, and DemPoRT example for real-world configuration file usage.
Step 5: Control reproducibility with seeds
Use seeds to generate the same mock data every time.
# Set seed for reproducibility
set.seed(12345)
df_mock <- data.frame(id = 1:100)
result1 <- create_cat_var(
var_raw = "SMK_01",
cycle = "cycle1",
variable_details = variable_details,
variables = variables,
length = 100,
df_mock = df_mock
)
# Reset seed
set.seed(12345)
df_mock <- data.frame(id = 1:100)
result2 <- create_cat_var(
var_raw = "SMK_01",
cycle = "cycle1",
variable_details = variable_details,
variables = variables,
length = 100,
df_mock = df_mock
)
# Verify identical
identical(result1$SMK_01, result2$SMK_01)[1] TRUE
Result: TRUE - same seed produces identical mock data
Step 6: Working with derived variables
MockData generates raw variables (direct survey measurements). Derived variables should be calculated from the generated data using harmonization workflows.
Conceptual workflow:
# 1. Generate mock raw variables
mock_data <- create_mock_data(
config = config, # Includes height_raw, weight_raw
details = details,
n = 1000
)
# 2. Apply harmonization to create derived variables
# (Requires cchsflow or recodeflow package)
# library(cchsflow)
# mock_data <- rec_with_table(
# data = mock_data,
# variables = variables,
# variable_details = variable_details,
# database_name = "cchs2001"
# )
# Now mock_data includes derived variables like BMI_der, age categories, etc.Why this approach:
- Mirrors real data processing (derived variables computed during harmonization)
- Allows testing harmonization logic with mock data
- Keeps raw and derived variables separate
- Tests the complete workflow: generate → harmonize → analyze
Common derived variables:
- BMI categories from height and weight
- Age categories from continuous age
- Income quintiles from income
- Health risk scores from multiple indicators
See CCHS example and DemPoRT example for complete workflows with derived variables.
What you learned
In this tutorial, you learned:
- How to prepare metadata (variables and variable_details tables)
- How to specify custom proportions for all categories including missing codes
- The critical difference between valid skip (996) and other missing codes (997-999)
- How to calculate prevalence correctly by handling missing codes appropriately
- How to generate continuous variables with
create_con_var() - How configuration files enable batch generation for larger projects
- How to ensure reproducibility with seeds
- How to work with derived variables through harmonization workflows
Next steps
Core topics:
- Missing data - Realistic missing data patterns in health surveys
- Date variables - Working with dates and survival times
- Configuration files - Batch generation approach
Database-specific examples:
- CCHS example - Canadian Community Health Survey
- CHMS example - Canadian Health Measures Survey
- DemPoRT example - Dementia Population Risk Tool
Advanced topics:
- Garbage data - Simulating data quality issues
- Advanced topics - Technical details and performance