About this vignette: This reference document provides the complete configuration schema specification. For step-by-step tutorials, see Generating datasets from configuration files.
Overview
MockData uses a two-file configuration system to define mock datasets. This reference documents the complete schema, all special codes, and validation rules.
File structure
mock_data_config.csv
Purpose: Lists which variables to generate
Required columns:
| Column | Type | Description | Example |
|---|---|---|---|
uid |
character | Unique identifier for this variable definition | "age_v1" |
variable |
character | Variable name (appears as column in output) | "age" |
role |
character | Set to “enabled” to generate; for dates use “baseline-date”, “index-date” | "enabled" |
variableType |
character | One of: "categorical", "continuous" (dates use “continuous”) |
"continuous" |
position |
integer | Generation order (1 = first, 2 = second, etc.) | 1 |
Optional columns:
| Column | Type | Description | Example |
|---|---|---|---|
variableLabel |
character | Short human-readable description | "Age in years" |
variableLabelLong |
character | Extended description | "Participant age at baseline interview" |
variableUnit |
character | Unit of measurement | "years" |
notes |
character | Implementation notes | "Rounded to nearest integer" |
Example:
uid,variable,role,variableType,variableLabel,position
age_v1,age,enabled,continuous,Age in years,1
smoking_v1,smoking,enabled,categorical,Smoking status,2
birth_date_v1,birth_date,baseline-date,continuous,Date of birth,3
Note: Dates use variableType = "continuous" with a date-related role for compatibility with recodeflow metadata.
mock_data_config_details.csv
Purpose: Defines categories, ranges, proportions, and data quality patterns
Required columns:
| Column | Type | Description | Example |
|---|---|---|---|
uid |
character | Must match uid in config file | "age_v1" |
uid_detail |
character | Unique identifier for this detail row | "age_v1_d1" |
variable |
character | Must match variable in config file | "age" |
recStart |
character | Input value or range | "[18, 100]" |
recEnd |
character | Output value or special code | "copy" |
Optional but commonly used columns:
| Column | Type | Description | Example |
|---|---|---|---|
catLabel |
character | Short label for this category | "Valid age" |
catLabelLong |
character | Extended category description | "Age in years at baseline" |
proportion |
numeric | Proportion (0-1), must sum to 1.0 per variable | 0.95 |
rType |
character | R data type for output (“integer”, “factor”, “Date”, “double”) | "integer" |
date_start |
Date | Start date (for date variables with recEnd = "date_start") |
"2001-01-01" |
date_end |
Date | End date (for date variables with recEnd = "date_end") |
"2017-03-31" |
Example:
uid,uid_detail,variable,recStart,recEnd,catLabel,catLabelLong,proportion,rType
age_v1,age_v1_d1,age,[18,100],copy,Valid age,Age in years,0.93,integer
age_v1,age_v1_d2,age,999,NA::b,Missing,Not stated,0.05,integer
age_v1,age_v1_d3,age,[200,300],corrupt_high,Invalid age,Data entry error,0.02,integer
smoking_v1,smoking_v1_d1,smoking,1,1,Daily,Daily smoker,0.25,factor
smoking_v1,smoking_v1_d2,smoking,2,2,Occasional,Occasional smoker,0.15,factor
smoking_v1,smoking_v1_d3,smoking,3,3,Never,Never smoked,0.60,factor
Special codes in recEnd
Missing data codes
Note on NA:: codes: These codes (NA::a, NA::b, NA::c) are part of the recodeflow harmonization framework, where they indicate how missing codes should be transformed during harmonization. When generating raw mock data, we use numeric missing codes that match the raw survey data format. However, the NA:: notation is handy in metadata because:
- It documents the meaning of numeric missing codes
- The same metadata can be reused for both mock data generation (outputs numeric codes) and harmonization (converts codes to proper R NA types)
- It maintains consistency between raw data simulation and the harmonization pipeline
| Code | Meaning | Example raw data codes |
|---|---|---|
NA::a |
Not applicable | Variable doesn’t apply to this person (e.g., pregnancy questions for males). CCHS/CHMS often use 996. |
NA::b |
Missing/refused | Participant refused to answer or data is missing. CCHS/CHMS often use 999. |
NA::c |
Don’t know | Participant doesn’t know the answer. CCHS/CHMS often use 998 or 997. |
Important: The specific numeric codes (996, 997, 998, 999) shown in examples are from CCHS/CHMS surveys documented in this package. Your database may use different codes: - Some surveys use single digits: 7, 8, 9 - Some use three digits: 997, 998, 999 - Some use ranges: [96, 99] - Check your survey’s data dictionary for the actual codes used
The NA:: notation works with any numeric coding scheme - just specify the appropriate numeric codes in recStart.
Example - generates numeric codes in raw mock data:
variable,recStart,recEnd,catLabel,proportion
alcohol_weekly,[0,100],copy,Drinks per week,0.85
alcohol_weekly,996,NA::a,Not applicable,0.10
alcohol_weekly,999,NA::b,Missing,0.05
This will generate raw data with numeric values 0-100, 996, and 999. During harmonization with recodeflow, the 996 and 999 codes will be converted to proper R NA values based on the NA::a and NA::b specifications.
Value transformation codes
| Code | Meaning | Usage |
|---|---|---|
copy |
Pass through unchanged | Use with ranges like [0, 100] - generates values in range |
date_start |
Extract date_start column | For date variables: marks the row containing start date |
date_end |
Extract date_end column | For date variables: marks the row containing end date |
Example:
variable,recStart,recEnd,catLabel,date_start,date_end,proportion
index_date,NA,date_start,Start date,2001-01-01,,1.0
index_date,NA,date_end,End date,,2017-03-31,1.0
Garbage/data quality codes
| Code | Meaning | Usage |
|---|---|---|
corrupt_low |
Below valid range | Generate values lower than expected (e.g., age = -5) |
corrupt_high |
Above valid range | Generate values higher than expected (e.g., age = 250) |
corrupt_future |
Future dates | For date variables: dates after valid range |
corrupt_past |
Past dates | For date variables: dates before valid range |
Example:
variable,recStart,recEnd,catLabel,proportion
age,[18,100],copy,Valid age,0.95
age,[200,300],corrupt_high,Invalid age,0.03
age,[-10,0],corrupt_low,Negative age,0.02
Important: Garbage proportions are SEPARATE from valid+missing proportions. MockData first allocates values to valid vs missing (which must sum to 1.0), then applies garbage to a subset of the valid values.
recStart syntax
Single values
recStart,recEnd,Meaning
1,1,Value 1
2,2,Value 2
999,NA::b,Code 999 becomes missing
Ranges
Inclusive ranges (both endpoints included):
recStart,Meaning
[0,100],Values from 0 to 100 inclusive
[18.5,25),Values from 18.5 (inclusive) to 25 (exclusive)
Range notation:
-
[a, b]: Inclusive on both ends (a ≤ x ≤ b) -
[a, b): Inclusive start, exclusive end (a ≤ x < b) -
(a, b]: Exclusive start, inclusive end (a < x ≤ b) -
(a, b): Exclusive on both ends (a < x < b)
For categorical variables: Ranges expand to all integer values
recStart,recEnd,Expands to
[1,5],copy,"1, 2, 3, 4, 5"
For continuous variables: Ranges define sampling bounds
recStart,recEnd,Generates
[0,100],copy,Random values uniformly distributed between 0 and 100
Date format
Raw data format: Real survey data (like CCHS/CHMS) stores dates in SAS format (e.g., 01JAN2001). MockData can parse these formats in recStart:
recStart,recEnd,Meaning
[01JAN2001,31MAR2017],copy,Dates between Jan 1 2001 and Mar 31 2017
Output format controlled by source_format parameter:
By default, MockData generates dates as R Date objects (analysis-ready format). However, you can simulate different source formats to test harmonization pipelines:
# Default: analysis-ready R Date objects
mock <- create_mock_data(..., source_format = "analysis")
# CSV format: character ISO strings
mock_csv <- create_mock_data(..., source_format = "csv")
# SAS format: numeric (days since 1960-01-01)
mock_sas <- create_mock_data(..., source_format = "sas")Format options:
-
"analysis"(default): R Date objects - ready for analysis -
"csv": Character strings ("2001-01-15") - simulatesread.csv()output -
"sas": Numeric values (days since 1960-01-01) - simulateshaven::read_sas()output
Use case: Testing harmonization code that needs to parse dates from raw sources:
# Generate CSV-format mock data
mock_csv <- create_mock_data(..., source_format = "csv")
# Test your date parsing logic
harmonized <- mock_csv %>%
mutate(interview_date = as.Date(interview_date, format = "%Y-%m-%d"))See Date variables and temporal data for detailed examples.
Alternative approach using date_start/date_end columns:
variable,recStart,recEnd,date_start,date_end,proportion
death_date,NA,date_start,2001-01-01,,1.0
death_date,NA,date_end,,2017-03-31,1.0
Both metadata approaches (SAS format in recStart vs. date_start/date_end columns) work with all source_format options.
Proportions
Basic rules
- Must sum to 1.0 per variable (excluding garbage rows)
- Garbage proportions are separate from valid/missing proportions
- Applies to population before garbage is added
Categorical variables
variable,recStart,recEnd,catLabel,proportion
smoking,1,1,Daily,0.25
smoking,2,2,Occasional,0.15
smoking,3,3,Never,0.55
smoking,999,NA::b,Missing,0.05
Sum check: 0.25 + 0.15 + 0.55 + 0.05 = 1.0 ✓
Continuous variables
variable,recStart,recEnd,catLabel,proportion
age,[18,100],copy,Valid age,0.95
age,999,NA::b,Missing,0.05
Interpretation:
- 95% of values: sampled uniformly from [18, 100]
- 5% of values: set to 999, then converted to NA
With garbage
variable,recStart,recEnd,catLabel,proportion
age,[18,100],copy,Valid age,0.95
age,999,NA::b,Missing,0.05
age,[200,300],corrupt_high,Invalid,0.02
Process:
- Generate population: 95% valid (18-100), 5% missing (999→NA)
- Apply garbage: Replace 2% of valid values with corrupt_high (200-300)
Result: ~93% valid, ~5% missing, ~2% garbage
Validation rules
Configuration file validation
Checked by validate_mock_data_config():
- Required columns present:
variable,variableType,variableLabel -
variableTypemust be one of:"categorical","continuous","date" - No duplicate variable names
- No empty variable names
Details file validation
Checked by validate_mock_data_config_details():
- Required columns present:
variable,recStart,recEnd,catLabel,proportion - All variables in details exist in config file
- Proportions are numeric and between 0 and 1
- Proportions sum to 1.0 per variable (excluding garbage rows)
- No duplicate
recStartvalues per variable - Range notation is well-formed
- Special codes are valid
Cross-file validation
Checked during generation:
- Every variable in config has at least one row in details
- Variable types match usage (e.g., date variables use date_start/date_end)
- Garbage rows have valid proportion values
- Date ranges are valid dates
Common patterns
Pattern 1: Simple categorical
# Config
uid,variable,role,variableType,variableLabel,position
sex_v1,sex,enabled,categorical,Biological sex,1
# Details
uid,uid_detail,variable,recStart,recEnd,catLabel,proportion,rType
sex_v1,sex_v1_d1,sex,1,1,Male,0.48,factor
sex_v1,sex_v1_d2,sex,2,2,Female,0.52,factor
Pattern 2: Continuous with missing
# Config
uid,variable,role,variableType,variableLabel,variableUnit,position
bmi_v1,bmi,enabled,continuous,Body mass index,kg/m²,1
# Details
uid,uid_detail,variable,recStart,recEnd,catLabel,proportion,rType
bmi_v1,bmi_v1_d1,bmi,[15,50],copy,Valid BMI,0.95,double
bmi_v1,bmi_v1_d2,bmi,999,NA::b,Missing,0.05,double
Pattern 3: Date variable with range
# Config
uid,variable,role,variableType,variableLabel,position
index_date_v1,index_date,index-date,continuous,Cohort entry date,1
# Details
uid,uid_detail,variable,recStart,recEnd,catLabel,date_start,date_end,proportion,rType
index_date_v1,index_date_v1_d1,index_date,NA,date_start,Start,2001-01-01,,1.0,Date
index_date_v1,index_date_v1_d2,index_date,NA,date_end,End,,2017-03-31,1.0,Date
Pattern 4: With data quality issues
# Config
uid,variable,role,variableType,variableLabel,position
age_v1,age,enabled,continuous,Age in years,1
# Details
uid,uid_detail,variable,recStart,recEnd,catLabel,proportion,rType
age_v1,age_v1_d1,age,[18,100],copy,Valid age,0.93,integer
age_v1,age_v1_d2,age,999,NA::b,Missing,0.05,integer
age_v1,age_v1_d3,age,[150,250],corrupt_high,Too high,0.01,integer
age_v1,age_v1_d4,age,[-5,0],corrupt_low,Negative,0.01,integer
Complete example
Here’s a complete two-file configuration for a simple cohort study:
mock_data_config.csv:
uid,variable,role,variableType,variableLabel,variableUnit,position
person_id_v1,person_id,enabled,continuous,Person ID,,1
age_v1,age,enabled,continuous,Age at baseline,years,2
sex_v1,sex,enabled,categorical,Biological sex,,3
smoking_v1,smoking,enabled,categorical,Smoking status,,4
index_date_v1,index_date,index-date,continuous,Cohort entry date,,5
death_date_v1,death_date,outcome-date,continuous,Death date,,6
mock_data_config_details.csv:
uid,uid_detail,variable,recStart,recEnd,catLabel,date_start,date_end,proportion,rType
person_id_v1,person_id_v1_d1,person_id,[1,100000],copy,Person ID,,,1.0,integer
age_v1,age_v1_d1,age,[18,100],copy,Valid age,,,0.93,integer
age_v1,age_v1_d2,age,999,NA::b,Missing,,,0.05,integer
age_v1,age_v1_d3,age,[150,200],corrupt_high,Invalid age,,,0.02,integer
sex_v1,sex_v1_d1,sex,1,1,Male,,,0.48,factor
sex_v1,sex_v1_d2,sex,2,2,Female,,,0.52,factor
smoking_v1,smoking_v1_d1,smoking,1,1,Daily,,,0.25,factor
smoking_v1,smoking_v1_d2,smoking,2,2,Occasional,,,0.15,factor
smoking_v1,smoking_v1_d3,smoking,3,3,Never,,,0.55,factor
smoking_v1,smoking_v1_d4,smoking,999,NA::b,Missing,,,0.05,factor
index_date_v1,index_date_v1_d1,index_date,NA,date_start,Start,2001-01-01,,1.0,Date
index_date_v1,index_date_v1_d2,index_date,NA,date_end,End,,2017-03-31,1.0,Date
death_date_v1,death_date_v1_d1,death_date,NA,date_start,Start,2001-01-01,,1.0,Date
death_date_v1,death_date_v1_d2,death_date,NA,date_end,End,,2025-12-31,1.0,Date
Generate the data:
library(MockData)
mock_data <- create_mock_data(
config_path = "mock_data_config.csv",
details_path = "mock_data_config_details.csv",
n = 1000,
seed = 123
)See also
- Getting started - Tutorial for creating your first mock dataset
- User guide - Complete feature documentation
- Advanced topics - Technical details and edge cases