Generate realistic test data from recodeflow variable configuration files (variables.csv and variable-details.csv).
30-second example
library(MockData)
# Generate mock data from metadata files
mock_data <- create_mock_data(
databaseStart = "minimal-example",
variables = system.file("extdata/minimal-example/variables.csv", package = "MockData"),
variable_details = system.file("extdata/minimal-example/variable_details.csv", package = "MockData"),
n = 100,
seed = 123
)
head(mock_data)
#> age smoking interview_date
#> 1 42 2 2001-08-05
#> 2 47 2 2002-11-27
#> 3 73 3 2002-11-14
#> 4 51 1 2005-01-14
#> 5 52 1 2001-02-25
#> 6 76 3 2004-07-25What’s in those CSV files? See inst/extdata/minimal-example/ - just variable names, types, and ranges. Total 9 rows across 2 files.
Why use variable metadata files?
Two paths:
-
Already using recodeflow? Use MockData with your existing
variables.csvandvariable_details.csvfiles. No duplicate specifications needed. - New to metadata? Upfront cost to create configuration files, but you get reproducible data pipelines and documentation that stays in sync with your code.
What is mock data?
In this package, “mock data” refers to metadata-driven simulated data created solely for software testing and workflow validation. Mock data are generated from variable specifications and contain no real person-level data or identifiable information.
Key distinctions:
- Mock data (this package): Generated from metadata only. Mimics variable structure and ranges but not real-world statistical relationships. Used for testing pipelines, not analysis.
- Synthetic data: Preserves statistical properties and relationships from real datasets through generative models. May be used for research when properly validated.
- Dummy data: Placeholder or minimal test data, often hardcoded or randomly generated without metadata constraints.
MockData creates data that looks realistic (appropriate variable types, value ranges, category labels, tagged NAs) but has no relationship to any actual population. Joint distributions and correlations may differ significantly from real-world data.
Use cases
Appropriate uses:
- Workflow testing and data pipeline validation
- Data harmonisation logic checks (cchsflow, chmsflow)
- Developing analysis scripts before data access
- Creating reproducible examples for documentation
- Training new analysts on survey data structure
Not appropriate for:
- Population inference or epidemiological modelling
- Predictive algorithm training
- Statistical analysis or research publication
- Any use requiring realistic joint distributions or correlations
Privacy and ethics
Generated mock data contain no personal information or individual-level identifiers. All data are created synthetically from metadata specifications, ensuring negligible risk of re-identification. This approach supports responsible, ethical, and reproducible public-health software development.
Features
-
Metadata-driven: Uses existing
variables.csvandvariable_details.csvfrom recodeflow projects - Universal: Works across CHMS, CCHS, and future recodeflow projects
- Recodeflow-standard: Supports all recodeflow notation formats (database-prefixed, bracket, mixed)
- Data quality testing: Generate invalid/out-of-range values to test validation pipelines
- Validation: Tools to check metadata quality
Installation
# Install from local directory
devtools::install_local("~/github/mock-data")
# Or install from GitHub (once published)
# devtools::install_github("Big-Life-Lab/MockData")Note: Package vignettes are in Quarto format (.qmd). To build vignettes locally, you need Quarto installed.
Next steps
Tutorials:
- Getting started - Complete tutorial from single variables to full datasets
- For recodeflow users - Using MockData with existing metadata
- Survival data - Time-to-event data and temporal patterns
Examples:
- Minimal example - Simplest possible metadata configuration
- CCHS example - CCHS workflow demonstration
- CHMS example - CHMS workflow demonstration
Configuration architecture
MockData uses a three-file architecture that separates project data dictionaries, study specifications, and MockData-specific parameters:
File structure
-
Project data dictionary (
variables.csv)- Variable names, labels, types, role flags
- Shared across analysis projects
- Example:
variable, variableType, label
-
Study specifications (
variable_details.csv)- Study date ranges, follow-up periods
- Category definitions and value ranges
- Transformation rules (recStart, recEnd, copy, catLabel)
- Example:
uvariable, recStart, catLabel
-
MockData-specific parameters (
mock_config.csv, optional)- Proportions of variable categories
- Event occurrence probabilities (
event_occurs) - Distribution parameters (
distribution) - Garbage data proportions (
prop_invalid) - Advanced features (survival data, data quality testing)
This separation allows MockData to read existing recodeflow metadata files without modification, while supporting optional MockData-specific configurations when needed.
Contributing
This package is part of the recodeflow ecosystem. See CONTRIBUTING.md for details.
License
MIT License - see LICENSE file for details.
The recodeflow universe
MockData is part of the recodeflow universe — a metadata-driven approach to variable recoding and harmonisation. The core philosophy is to define variable transformations once in metadata files, then reuse those definitions for harmonisation, documentation, and mock data generation.
Design principles:
- Metadata-driven: Variable definitions and recode rules live in structured metadata (CSV files)
- Reusable: Same metadata drives harmonisation code, documentation, and testing data
- Survey-focused: Built for health surveys (CCHS, CHMS) but applicable to any categorical/continuous data
- Open and reproducible: Transparent recode logic that anyone can inspect and verify
Related packages:
- cchsflow: Harmonisation workflows for Canadian Community Health Survey (CCHS)
- chmsflow: Harmonisation workflows for Canadian Health Measures Survey (CHMS)
- recodeflow: Core metadata specifications and utilities
Data sources and acknowledgements
The example metadata in this package is derived from:
- Canadian Community Health Survey (CCHS) — Statistics Canada
- Canadian Health Measures Survey (CHMS) — Statistics Canada
Statistics Canada Open License:
The use of CCHS and CHMS metadata examples in this package falls under Statistics Canada’s Open License, which permits use, reproduction, and distribution of Statistics Canada data products. We acknowledge Statistics Canada as the source of the survey designs and variable definitions that informed our example metadata files.
Important: This package generates mock data only. It does not contain, distribute, or provide access to any actual Statistics Canada microdata. Real CCHS and CHMS data are available through Statistics Canada’s Research Data Centres (RDCs) and Public Use Microdata Files (PUMFs) under appropriate data access agreements.
For more information: Statistics Canada Open License
Development environment setup
This package uses renv for reproducible package development environments.
R version compatibility
- Supported: R 4.3.x - 4.4.x
- Lockfile baseline: R 4.4.2 (institutional environments typically run 1-2 versions behind current)
- The renv lockfile works across this version range - minor R version differences are handled automatically
Daily development workflow
# Install new packages as normal
install.packages("packagename")
# After adding dependencies to DESCRIPTION:
devtools::install_dev_deps() # Install updated dependencies
renv::snapshot() # Update lockfile
# Commit the updated renv.lock file
# Check environment status anytime:
renv::status()Building documentation and site
# Generate function documentation
devtools::document()
# Install the package (required before building site)
devtools::install(upgrade = 'never')
# Build pkgdown site (requires Quarto installed)
pkgdown::build_site()
# Run tests
devtools::test()Troubleshooting
# If packages seem out of sync:
renv::status()
# To update package versions:
renv::update()
renv::snapshot()
# To restore to lockfile state:
renv::restore()For more details, see CONTRIBUTING.md.