Skip to contents

About this vignette: This tutorial is for users who already have recodeflow metadata files (variables.csv and variable_details.csv). If you’re new to MockData, start with Getting started instead.

Quick start

If you already have recodeflow metadata files, you’re one function call away from generating mock data:

library(MockData)

# Point MockData at your existing metadata files
mock_data <- create_mock_data(
  databaseStart = "cchs2001",  # Your database identifier
  variables = "path/to/variables.csv",
  variable_details = "path/to/variable_details.csv",
  n = 1000,
  seed = 123
)

head(mock_data)

That’s it. MockData reads your existing metadata and generates mock data that matches your variable specifications.

What MockData reads from your files

MockData uses the same metadata files as cchsflow and chmsflow. No duplicate specifications needed.

From variables.csv

MockData reads:

  • variable: Variable name in the generated dataset
  • rType: R data type (factor, character, integer, double, date)
  • role: Filters for enabled variables only
  • distribution: Distribution type (uniform, normal, exponential, gompertz) for continuous/date variables
  • mean, sd, rate, shape: Distribution parameters
  • event_prop, followup_min, followup_max: Survival data parameters
  • prop_garbage, garbage_low_prop, garbage_high_prop: Data quality testing parameters

Example:

variable rType role distribution
age integer enabled,predictor,table1 normal
smoking factor enabled,predictor,table1
BMI double enabled,outcome,table1 normal
height double enabled,predictor normal
weight double enabled,predictor normal

From variable_details.csv

MockData reads:

  • recStart: Category codes or value ranges using interval notation (e.g., [18,100] for continuous, [2001-01-01,2005-12-31] for dates)
  • recEnd: Classification (copy, NA::a, NA::b)
  • catLabel: Category labels
  • proportion: Category proportions (optional)

Example:

variable recStart recEnd catLabel proportion
5 smoking 1 1 Never smoker 0.50
6 smoking 2 2 Former smoker 0.30
7 smoking 3 3 Current smoker 0.17
8 smoking 7 NA::b Don’t know 0.03

The databaseStart parameter

The databaseStart parameter tells MockData which database/cycle to generate data for. This is the same identifier you use in recodeflow workflows.

# Generate data for CCHS 2001
mock_cchs2001 <- create_mock_data(
  databaseStart = "cchs2001_p",  # Match your database identifier
  variables = "variables.csv",
  variable_details = "variable_details.csv",
  n = 1000
)

# Generate data for CHMS Cycle 1
mock_chms1 <- create_mock_data(
  databaseStart = "cycle1",  # Match your database identifier
  variables = "variables.csv",
  variable_details = "variable_details.csv",
  n = 1000
)

MockData filters variable_details.csv to only generate variables where the databaseStart column matches your specified database.

Working example with minimal-example metadata

Let’s generate mock data using the minimal-example metadata included with MockData:

# Load recodeflow-compatible metadata
variables <- read.csv(
  system.file("extdata/minimal-example/variables.csv", package = "MockData"),
  stringsAsFactors = FALSE,
  check.names = FALSE
)

variable_details <- read.csv(
  system.file("extdata/minimal-example/variable_details.csv", package = "MockData"),
  stringsAsFactors = FALSE,
  check.names = FALSE
)

# Generate mock data
mock_data <- create_mock_data(
  databaseStart = "minimal-example",
  variables = variables,
  variable_details = variable_details,
  n = 100,
  seed = 456
)

# View structure
cat("Generated", nrow(mock_data), "observations across", ncol(mock_data), "variables\n\n")
Generated 100 observations across 5 variables

View sample data:

  age smoking    height    weight interview_date
1  30       1 0.5259396  52.05418     2004-08-20
2  59       1 1.2789355 104.37787     2001-12-09
3  62       2 0.2535377  87.96214     2003-07-30
4  29       1 0.3669026  65.09685     2002-03-22
5  39       2 0.6075947  89.62049     2004-03-26
6  45       1 1.2702433  88.92887     2005-07-24

Common workflows

Testing harmonisation code

Use mock data to test cchsflow or chmsflow harmonisation before accessing real data:

library(cchsflow)

# 1. Generate mock raw data
mock_raw <- create_mock_data(
  databaseStart = "cchs2001_p",
  variables = "variables.csv",
  variable_details = "variable_details.csv",
  n = 1000
)

# 2. Apply harmonisation
mock_harmonised <- rec_with_table(
  data = mock_raw,
  variables = variables,
  variable_details = variable_details,
  databaseStart = "cchs2001_p"
)

# 3. Test your analysis code
mock_harmonised %>%
  filter(age_der >= 65) %>%
  summarise(prevalence_bmi_obesity = mean(bmi_der_cat == "Obese", na.rm = TRUE))

Developing analysis scripts

Write and debug analysis scripts before data access:

library(dplyr)

# Generate mock data
mock_data <- create_mock_data(
  databaseStart = "cchs2001_p",
  variables = "variables.csv",
  variable_details = "variable_details.csv",
  n = 5000,
  seed = 789
)

# Develop analysis pipeline
results <- mock_data %>%
  filter(!is.na(age), !is.na(smoking)) %>%
  group_by(smoking) %>%
  summarise(
    n = n(),
    mean_age = mean(age),
    sd_age = sd(age)
  )

# Test visualisations
ggplot(mock_data, aes(x = age, fill = smoking)) +
  geom_density(alpha = 0.5) +
  labs(title = "Age distribution by smoking status (MOCK DATA)")

Training new team members

Generate safe, non-sensitive data for training:

# Generate training dataset
training_data <- create_mock_data(
  databaseStart = "cchs2001_p",
  variables = "variables.csv",
  variable_details = "variable_details.csv",
  n = 500,
  seed = 111
)

# Save for training exercises
write.csv(training_data, "training_cchs_mock.csv", row.names = FALSE)

Advanced features

Specifying category proportions

Add a proportion column to variable_details.csv to control category distributions:

variable,recStart,recEnd,catLabel,proportion
smoking,1,1,Never,0.50
smoking,2,2,Former,0.30
smoking,3,3,Current,0.20

Without proportions, MockData generates equal probabilities for all categories.

Survival data and custom distributions

Advanced features are specified directly in variables.csv using additional columns:

Survival data parameters:

  • event_prop: Probability event occurs (0-1)
  • followup_min, followup_max: Follow-up time range in days
  • distribution: Distribution type (uniform, gompertz, exponential)
  • rate, shape: Distribution parameters

Example:

uid,variable,rType,role,distribution,rate,shape,event_prop,followup_min,followup_max
ices_v02,primary_event_date,date,enabled,gompertz,0.0001,0.1,0.10,0,5475
ices_v03,death_date,date,enabled,gompertz,0.0001,0.1,0.20,365,7300

See Generating survival data with competing risks for details.

Data quality testing

Add garbage data for testing validation pipelines using these variables.csv columns:

  • prop_garbage: Simple auto-generated garbage proportion
  • garbage_low_prop, garbage_low_range: Below-range invalid values
  • garbage_high_prop, garbage_high_range: Above-range invalid values

See Testing data quality and validation for details.

Differences from real data

Important limitations:

MockData generates data that matches your metadata specifications but does not preserve real-world statistical relationships:

  • Variables are generated independently
  • No correlations between variables (e.g., age and health status)
  • Joint distributions may differ from actual survey data
  • Temporal patterns are simplified

Never use mock data for:

  • Research publications
  • Population inference
  • Predictive modelling
  • Algorithm training

Safe uses:

  • Testing harmonisation workflows
  • Developing analysis scripts
  • Training team members
  • Creating documentation examples

Next steps

Tutorials:

Reference: