About this vignette: This reference document provides the complete configuration schema specification for MockData v0.2.1. For step-by-step tutorials, see Generating datasets from configuration files.
Finding specific columns
This is a comprehensive reference with 20+ configuration columns. Use your browser’s search (Ctrl+F / Cmd+F) to quickly find specific columns by name. Jump to sections: variables.csv | variable_details.csv | Examples
Quick reference
Essential columns for getting started:
| Column | File | Required | Purpose | Example |
|---|---|---|---|---|
uid |
variables.csv | Yes | Unique variable identifier | "v001" |
variable |
variables.csv | Yes | Output column name | "age" |
variableType |
variables.csv | Yes | Categorical or Continuous (from recodeflow) | "Continuous" |
rType |
variables.csv | No | R output type (integer, double, character, date, factor) | "integer" |
distribution |
variables.csv | Continuous/Date | normal, uniform, gompertz | "normal" |
mean, sd
|
variables.csv | Normal dist | Distribution parameters |
50, 15
|
garbage_low_prop |
variables.csv | No | QA testing (low values) | 0.01 |
garbage_high_prop |
variables.csv | No | QA testing (high values) | 0.03 |
uid_detail |
variable_details.csv | Yes | Unique detail identifier | "d001" |
recStart |
variable_details.csv | Yes | Category code or range |
"1" or "[18,100]"
|
recEnd |
variable_details.csv | Conditional | Missing data classification |
"1", "NA::a", "NA::b"
|
catLabel |
variable_details.csv | No | Category label | "Never smoker" |
proportion |
variable_details.csv | Categorical | Category probability (0-1) | 0.5 |
For complete column documentation, see sections below.
How this vignette is generated
This vignette uses inline R code to generate documentation directly from the actual example data. Column counts, CSV examples, and dataset summaries are calculated dynamically from:
This approach ensures the documentation stays synchronized with the package and serves as integration testing during package builds.
Overview
MockData uses a two-file configuration system to define mock datasets:
- variables.csv - Variable-level metadata and generation parameters (24 columns total: 4 core + 20 extensions)
- variable_details.csv - Detail-level specifications for categories and ranges (7 columns: 4 core + 3 extensions)
This reference documents the complete v0.2.1 schema, including all extension columns, interval notation, and validation rules.
File: variables.csv
Purpose: Variable-level metadata and generation parameters
Core columns (from recodeflow)
| Column | Type | Required | Description | Example |
|---|---|---|---|---|
uid |
character | Yes | Unique identifier for this variable | "v001" |
variable |
character | Yes | Variable name (column in output) | "age" |
label |
character | No | Human-readable description | "Age in years" |
variableType |
character | Yes | From recodeflow: "Categorical" or "Continuous"
|
"Continuous" |
UID format: Use pattern vNNN with zero-padded numbers (e.g., v001, v002, v010)
Extension columns (MockData-specific)
Type and generation control
| Column | Type | Description | Values | Example |
|---|---|---|---|---|
rType |
character | R data type for output |
"integer", "double", "character", "date", "factor"
|
"integer" |
role |
character | Multi-valued roles (comma-separated) | "enabled,predictor,table1" |
"enabled,predictor" |
position |
integer | Generation order (use increments of 10) |
10, 20, 30
|
10 |
seed |
integer | Random seed for this variable | Any integer | 100 |
Role values:
-
enabled- Generate this variable (required for generation) -
predictor- Use in regression models -
outcome- Outcome variable -
metadata- Metadata/administrative variable -
table1- Include in Table 1 summary
Seed pattern: Recommended: seed = position × 10 (ensures reproducibility and prevents correlation artifacts)
Garbage data (data quality testing)
| Column | Type | Description | Values | Example |
|---|---|---|---|---|
garbage_low_prop |
numeric | Proportion of low-range garbage data | 0 to 1 | 0.01 |
garbage_low_range |
character | Range for low garbage data | Interval notation | "[-5,10]" |
garbage_high_prop |
numeric | Proportion of high-range garbage data | 0 to 1 | 0.03 |
garbage_high_range |
character | Range for high garbage data | Interval notation | "[120,150]" |
prop_garbage |
numeric | Proportion of auto-generated invalid values | 0 to 1 | 0.05 |
Two garbage data modes:
-
Advanced (precise control): Use
garbage_low_prop/garbage_low_rangeand/orgarbage_high_prop/garbage_high_rangeto specify exact ranges -
Simple (auto-generated): Use
prop_garbagefor automatic invalid value generation
Precedence: If garbage_low_prop OR garbage_high_prop specified, those take precedence. Otherwise, prop_garbage is used.
Interpretation by variable type:
-
Categorical:
prop_garbagegenerates invalid codes (99, 999, 88, etc. not in valid categories) - Continuous: Advanced uses specified ranges; simple generates out-of-range values
- Date: Advanced uses specified date ranges; simple generates dates 1-5 years before/after valid range
Examples:
# Advanced: Age with precise low garbage data (negative ages)
uid,variable,garbage_low_prop,garbage_low_range,garbage_high_prop,garbage_high_range,prop_garbage
v001,age,0.01,"[-5,10]",NA,"[,]",NA
# Advanced: BMI with two-sided garbage data (2% low + 1% high)
uid,variable,garbage_low_prop,garbage_low_range,garbage_high_prop,garbage_high_range,prop_garbage
v003,BMI,0.02,"[-10,0]",0.01,"[60,150]",NA
# Simple: Smoking with auto-generated invalid codes
uid,variable,garbage_low_prop,garbage_low_range,garbage_high_prop,garbage_high_range,prop_garbage
v002,smoking,NA,NA,NA,NA,0.05
# Simple: Death date with auto-generated out-of-period dates
uid,variable,garbage_low_prop,garbage_low_range,garbage_high_prop,garbage_high_range,prop_garbage
v006,death_date,NA,"[,]",NA,"[,]",0.02
Sentinel values: Use NA for not applicable, "[,]" for empty ranges.
Distribution parameters (continuous and date variables)
| Column | Type | Description | Values | Example |
|---|---|---|---|---|
distribution |
character | Distribution type |
"normal", "uniform", "gompertz", "exponential"
|
"normal" |
mean |
numeric | Mean (normal distribution) | Any number | 50 |
sd |
numeric | Standard deviation (normal) | Positive number | 15 |
rate |
numeric | Rate parameter (gompertz/exponential) | Positive number | 0.0001 |
shape |
numeric | Shape parameter (gompertz) | Any number | 0.1 |
Distribution types:
For continuous variables:
-
"normal"- Normal (Gaussian) distribution (requiresmean,sd) -
"uniform"- Uniform distribution over valid range
For date variables:
-
"uniform"- Equal probability for all dates -
"gompertz"- Age-related hazard (requiresrate,shape,followup_min,followup_max,event_prop) -
"exponential"- Constant hazard (requiresrate,followup_min,followup_max,event_prop)
For categorical variables: Set to NA (categories defined by proportions in variable_details.csv)
Examples:
# Age: normal distribution
uid,variable,distribution,mean,sd,rate,shape
v001,age,normal,50,15,NA,NA
# BMI: normal distribution
uid,variable,distribution,mean,sd,rate,shape
v003,BMI,normal,27.5,5.2,NA,NA
# Interview date: uniform
uid,variable,distribution,mean,sd,rate,shape
v004,interview_date,uniform,NA,NA,NA,NA
# Primary event: gompertz survival
uid,variable,distribution,mean,sd,rate,shape
v005,primary_event_date,gompertz,NA,NA,0.0001,0.1
Survival parameters (date variables with events)
| Column | Type | Description | Values | Example |
|---|---|---|---|---|
followup_min |
integer | Minimum follow-up days | Positive integer | 365 |
followup_max |
integer | Maximum follow-up days | Positive integer | 5475 |
event_prop |
numeric | Proportion experiencing event | 0 to 1 | 0.1 |
When to use:
- Date variables representing events (death_date, disease_diagnosis, etc.)
- NOT for index dates (interview_date) - those are the time origin
Example:
# Primary event date: 10% experience dementia diagnosis within 1-15 years
uid,variable,distribution,followup_min,followup_max,event_prop
v005,primary_event_date,gompertz,365,5475,0.1
# Death date: 20% die within 1-20 years (competing risk)
uid,variable,distribution,followup_min,followup_max,event_prop
v006,death_date,gompertz,365,7300,0.2
# Loss to follow-up: 10% lost within 1-20 years (censoring)
uid,variable,distribution,followup_min,followup_max,event_prop
v_007,ltfu_date,uniform,365,7300,0.1
Versioning
| Column | Type | Description | Format | Example |
|---|---|---|---|---|
mockDataVersion |
character | Semantic version | MAJOR.MINOR.PATCH |
"1.0.0" |
mockDataLastUpdated |
character | Last update date | YYYY-MM-DD |
"2025-11-09" |
mockDataVersionNotes |
character | Version notes | Free text | "Initial version" |
Use cases:
- Track changes to generation parameters over time
- Document why garbage data values changed
- Maintain audit trail for reproducibility
Complete example
"uid","variable","label","variableType","rType","role","position","seed","garbage_low_prop","garbage_low_range","garbage_high_prop","garbage_high_range","distribution","mean","sd","rate","shape","followup_min","followup_max","event_prop","sourceFormat","mockDataVersion","mockDataLastUpdated","mockDataVersionNotes"
"cchsflow_v0001","age","Age in years","Continuous","integer","enabled,predictor,table1",10,10,NA,"[;]",NA,"[;]","normal",50,15,NA,NA,NA,NA,NA,"","1.0.0","2025-11-09","Normal distribution (mean=50, sd=15)"
"cchsflow_v0002","smoking","Smoking status","Categorical","factor","enabled,predictor,table1",20,20,NA,"",NA,"","",NA,NA,NA,NA,NA,NA,NA,"","1.0.0","2025-11-09","Categorical variable with proportions"
"cchsflow_v0003","BMI","Body mass index","Continuous","double","enabled,outcome,table1",30,30,0.02,"[-10;15])",0.01,"[60;150]","normal",27.5,5.2,NA,NA,NA,NA,NA,"","1.0.0","2025-11-09","Normal distribution with two-sided contamination"
"cchsflow_v0004","height","Height in meters","Continuous","double","enabled,predictor",40,40,1,"[0;1.4)",0.01,"(2.1;inf]","normal",1.7,0.1,NA,NA,NA,NA,NA,"","1.0.0","2025-11-13","Height for BMI calculation"
"cchsflow_v0005","weight","Weight in kilograms","Continuous","double","enabled,predictor",50,50,NA,"[;]",NA,"[;]","normal",75,15,NA,NA,NA,NA,NA,"","1.0.0","2025-11-13","Weight for BMI calculation"
"cchsflow_v0006","BMI_derived","BMI calculated from height and weight","Continuous","double","enabled,outcome,table1",60,60,NA,"[;]",NA,"[;]","",NA,NA,NA,NA,NA,NA,NA,"","1.0.0","2025-11-13","Derived variable: BMI = weight / (height^2)"
"ices_v01","interview_date","Interview date (cohort entry)","Continuous","date","enabled,outcome",70,70,0,"[;]",0,"[;]","uniform",NA,NA,NA,NA,NA,NA,NA,"analysis","1.0.0","2025-11-09","Uniform date range (cohort entry)"
"ices_v02","primary_event_date","Primary event date (dementia diagnosis)","Continuous","date","enabled,outcome,table1",80,80,0,"[;]",0.03,"[2021-01-01;2099-12-31]","gompertz",NA,NA,1e-04,0.1,0,5475,0.3,"analysis","1.0.0","2025-11-09","Gompertz survival with temporal violations (3%)"
"ices_v03","death_date","Death date (competing risk)","Continuous","date","enabled,outcome, table1",90,90,0,"[;]",0.03,"[2025-01-01;2099-12-31]","gompertz",NA,NA,1e-04,0.1,365,7300,0.2,"analysis","1.0.0","2025-11-09","Gompertz survival with auto-generated date corruption"
"ices_v04","ltfu_date","Loss to follow-up date","Continuous","date","enabled,outcome",100,100,0,"[;]",0.03,"[2025-01-01;2099-12-31]","uniform",NA,NA,NA,NA,365,7300,0.1,"analysis","1.0.0","2025-11-09","Uniform censoring (10% occurrence)"
"ices_v05","admin_censor_date","admin_censor_date","Continuous","date","enabled,outcome",110,110,0,"[;]",0,"[;]","",NA,NA,NA,NA,365,7300,1,"analysis","1.0.0","2025-11-09",""
File: variable_details.csv
Purpose: Detail-level specifications for categories and ranges
Core columns (from recodeflow)
| Column | Type | Required | Description | Example |
|---|---|---|---|---|
uid |
character | Yes | Foreign key to variables.csv | "v001" |
uid_detail |
character | Yes | Unique identifier for this row | "d001" |
variable |
character | Yes | Must match variable in variables.csv | "age" |
recStart |
character | Yes | Input value or range |
"[18,100]" or "1"
|
UID relationships:
-
uidmust exist in variables.csv (foreign key) -
uid_detailmust be unique across entire file - Pattern:
d_NNNwith zero-padded numbers
Extension columns (MockData-specific)
| Column | Type | Description | Example |
|---|---|---|---|
catLabel |
character | Category label or description |
"Valid age range" or "Never smoker"
|
proportion |
numeric | Population proportion (0-1) | 0.5 |
Proportion rules:
- Must sum to 1.0 per variable (for categorical variables)
- Use
NAfor continuous/date variables with single range specification
Complete example
"uid","uid_detail","variable","recStart","recEnd","catLabel","proportion"
"cchsflow_v0001","cchsflow_d00001","age","[18,100]","copy","Valid age range",0.9
"cchsflow_v0001","cchsflow_d00002","age","997","NA::b","Don't know",0.05
"cchsflow_v0001","cchsflow_d00003","age","998","NA::b","Refusal",0.03
"cchsflow_v0001","cchsflow_d00004","age","999","NA::b","Not stated",0.02
"cchsflow_v0002","cchsflow_d00005","smoking","1","1","Never smoker",0.5
"cchsflow_v0002","cchsflow_d00006","smoking","2","2","Former smoker",0.3
"cchsflow_v0002","cchsflow_d00007","smoking","3","3","Current smoker",0.17
"cchsflow_v0002","cchsflow_d00008","smoking","7","NA::b","Don't know",0.03
"cchsflow_v0003","cchsflow_d00009","BMI","[15,50]","copy","Valid BMI range",NA
"cchsflow_v0003","cchsflow_d00010","BMI","996","NA::a","Not applicable",0.3
"cchsflow_v0003","cchsflow_d00011","BMI","[997,999]","NA::b","Don't know, refusal, not stated",0.1
"cchsflow_v0004","cchsflow_d00012","height","[1.4,2.1]","copy","Valid height range (meters)",NA
"cchsflow_v0004","cchsflow_d00013","height","else","NA::b","Missing height",0.02
"cchsflow_v0005","cchsflow_d00014","weight","[35,150]","copy","Valid weight range (kg)",NA
"cchsflow_v0005","cchsflow_d00015","weight","else","NA::b","Missing weight",0.03
"cchsflow_v0006","cchsflow_d00016","BMI_derived","DerivedVar::[height, weight]","Func::bmi_fun","BMI calculated from height and weight",NA
"ices_v01","ices_d001","interview_date","[2001-01-01,2005-12-31]","copy","Interview date range",1
"ices_v01","ices_d002","interview_date","else","NA::b","Missing interview date",0
"ices_v02","ices_d003","primary_event_date","[2002-01-01,2021-01-01]","copy","Primary event date range",0.1
"ices_v02","ices_d004","primary_event_date","else","NA::b","Missing event date",0
"ices_v03","ices_d005","death_date","[2002-01-01,2024-12-31]","copy","Death date range",0.2
"ices_v03","ices_d006","death_date","else","NA::b","Missing death date",0.05
"ices_v04","ices_d007","ltfu_date","[2002-01-01,2024-12-31]","copy","Loss to follow-up date range",0.05
"ices_v04","ices_d008","ltfu_date","else","NA::b","Missing ltfu date",NA
"ices_v05","ices_d009","admin_censor_date","2024-12-31","copy","Administrative censor date",1
"ices_v05","ices_d010","admin_censor_date","else","NA::b","Missing administrative censor date",0
Note: Distribution parameters (mean, sd, rate, shape, etc.) are specified in variables.csv, NOT in variable_details.csv.
recStart syntax
The recStart column specifies input values or ranges using either single values or interval notation.
Single values (categorical variables)
For categorical variables, specify exact category codes:
uid,recStart,catLabel,proportion
v002,1,Never smoker,0.50
v002,2,Former smoker,0.30
v002,3,Current smoker,0.20
Interval notation (continuous and date variables)
For continuous or date variables, use interval notation: [min,max]
Format: [min,max] with comma delimiter
Common bracket types:
-
[a,b]- Inclusive on both ends (most common) -
[a,b)- Inclusive start, exclusive end
Examples:
# Numeric range: age 18 to 100 (inclusive)
uid,recStart
v001,"[18,100]"
# Date range: interview dates 2001-2005
uid,recStart
v004,"[2001-01-01,2005-12-31]"
# Valid range for distribution truncation
uid,recStart
v003,"[18,40]"
Important: Always use double quotes around interval notation in CSV files to ensure comma inside brackets is not treated as column delimiter.
recEnd for missing data classification
The recEnd column classifies codes into missing vs. valid categories, enabling automatic missing data generation.
Purpose: Distinguishes between:
- Valid response codes (1, 2, 3)
- Skip codes (6, 96, 996 - question not applicable)
- Missing codes (7-9, 97-99 - don’t know, refusal, not stated)
Conditional requirement: Required when recStart contains missing data codes (6-9, 96-99) to enable proper classification.
Valid response codes
Map input codes to themselves using numeric values:
uid,uid_detail,variable,recStart,recEnd,catLabel,proportion
v002,d_002,smoking,1,1,Never smoker,0.50
v002,d_003,smoking,2,2,Former smoker,0.30
v002,d_004,smoking,3,3,Current smoker,0.17
Pattern: recStart="1" → recEnd="1" (code maps to itself)
Skip codes: NA::a
For valid skip/not applicable codes (question not asked due to logic):
uid,uid_detail,variable,recStart,recEnd,catLabel,proportion
v002,d_005,smoking,6,NA::a,Valid skip,0.01
Statistical treatment: Exclude from denominator (respondent was not eligible for question)
Common codes: 6, 96, 996 (varies by survey)
Missing codes: NA::b
For don’t know, refusal, not stated (question asked but no valid response):
uid,uid_detail,variable,recStart,recEnd,catLabel,proportion
v002,d_006,smoking,7,NA::b,Don't know,0.02
v002,d_007,smoking,9,NA::b,Not stated,0.01
Statistical treatment: Include in denominator when calculating response rates, exclude from numerator
Common codes: 7 (don’t know), 8 (refusal), 9 (not stated), 97, 98, 99
Range notation: Can use [7,9] → NA::b to map multiple codes at once
Continuous and date variables
Use "copy" for valid ranges:
uid,uid_detail,variable,recStart,recEnd,catLabel
v001,d001,age,"[18,100]",copy,Valid age range
v004,d_007,interview_date,"[2001-01-01,2005-12-31]",copy,Interview date range
Pattern: recEnd=“copy” indicates the range should be used as-is for generation
Complete example with missing data
uid,uid_detail,variable,recStart,recEnd,catLabel,proportion
v002,d_002,smoking,1,1,Daily smoker,0.25
v002,d_003,smoking,2,2,Occasional,0.15
v002,d_004,smoking,3,3,Never,0.57
v002,d_005,smoking,7,NA::b,Don't know,0.03
# Sum: 1.00 ✓
Result: When generating data, get_variable_categories(include_na=TRUE) returns only code “7”, while get_variable_categories(include_na=FALSE) returns codes “1”, “2”, “3”.
Why recEnd is required
Without recEnd: Cannot distinguish between:
- Code 1 (valid response)
- Code 7 (missing - don’t know)
With recEnd: Explicit classification enables:
- Automatic
prop_NAparameter handling - Correct missing vs. valid proportions
- Statistical analysis (response rates, prevalence)
Validation: If recStart contains codes 6-9 or 96-99 and recEnd column is missing, validation will error with instructions.
Proportions
Basic rules
- Must sum to 1.0 per variable (for categorical distributions)
-
Garbage data is separate -
corrupt_*_propandprop_garbagenot included in proportion sum -
Event proportions -
event_propin variables.csv may be < 1.0 (represents censoring)
Categorical variables
Proportions define population distribution:
uid,recStart,catLabel,proportion
v002,1,Never,0.50
v002,2,Former,0.30
v002,3,Current,0.17
v002,7,Missing,0.03
# Sum: 1.00 ✓
Continuous and date variables
Use NA for proportion:
uid,recStart,catLabel,proportion
v001,"[18,100]","Valid age range",NA
v004,"[2001-01-01,2005-12-31]","Interview date range",NA
With missing codes (categorical)
Missing codes are part of population (must sum to 1.0):
uid,recStart,catLabel,proportion
v002,1,Never smoker,0.50
v002,2,Former smoker,0.30
v002,3,Current smoker,0.17
v002,7,Don't know,0.03
# Sum: 1.00 ✓
Validation rules
MockData validates configuration files on load:
UID validation
-
Pattern:
uidshould follow formatvNNN(e.g., v001, v042),uid_detailshould followdNNN(e.g., d001, d156) -
Uniqueness: All
uidvalues unique in variables.csv -
Uniqueness: All
uid_detailvalues unique in variable_details.csv -
Foreign keys: All
uidin variable_details exist in variables.csv
Column validation
-
Required columns present:
- variables.csv: uid, variable, variableType
- variable_details.csv: uid, uid_detail, variable, recStart
- Variable name match: variable_details.variable matches variables.variable
- Column placement: Extension columns only in correct files (e.g., distribution params in variables.csv)
Value validation
- Proportions: Sum to 1.0 per variable (±0.01 tolerance)
- Garbage data proportions: Between 0 and 1
-
Garbage data ranges: Use interval notation
[min,max]or sentinel"[,]" - rType values: One of: integer, double, factor, date, character, logical
- distribution values: One of: normal, uniform, gompertz, exponential, or NA
- Versioning: mockDataVersion follows semantic versioning (if present)
- Complete grid: All cells have explicit values (no empty strings for optional columns - use NA or sentinel values)
Detailed column reference
uid and uid_detail: Unique identifiers
Purpose: Permanent identifiers for traceability across metadata versions
Format requirements:
-
uid: PatternvNNNwith zero-padding (e.g.,v001,v042) -
uid_detail: PatterndNNNwith zero-padding (e.g.,d001,d156) - Zero-padding recommended for sorting:
v001notv1
Common errors:
# ❌ WRONG: Inconsistent padding
uid,variable
v_1,age
v_10,smoking
v_2,BMI
# Result: Sorts incorrectly (v_1, v_10, v_2)
# ✅ CORRECT: Zero-padded
uid,variable
v001,age
v002,smoking
v010,BMI
# Result: Sorts correctly (v001, v002, v_010)
Edge cases:
- UIDs must be globally unique across all databases
- Reusing UIDs across databases for “same variable” is acceptable but requires consistent definitions
- Changing UID means “new variable” even if variable name unchanged
Best practices:
- Start numbering at v_001 (not v_000)
- Use sequential numbers for easy tracking
- Document UID assignment logic in project README
- Use project prefixes for multi-project repos:
cchs_v001,chms_v001
role: Multi-valued variable roles
Purpose: Tag variables for different purposes in analysis
Valid values (comma-separated):
| Value | Meaning | Use case |
|---|---|---|
enabled |
Required for generation | Must be present to generate variable |
predictor |
Independent/explanatory variable | Regression models, Table 1 |
outcome |
Dependent/response variable | Primary/secondary outcomes |
metadata |
Administrative/tracking | IDs, dates, survey metadata |
table1 |
Summary table variable | Baseline characteristics |
Examples:
# Age: predictor for models, show in Table 1
uid,variable,role
v001,age,"enabled,predictor,table1"
# Primary outcome: show in Table 1
uid,variable,role
v005,dementia_diagnosis,"enabled,outcome,table1"
# Study ID: just metadata
uid,variable,role
v020,study_id,"enabled,metadata"
# Derived variable: not enabled (calculated post-generation)
uid,variable,role
v006,BMI_derived,outcome
Important: Variable will NOT be generated unless role contains "enabled".
Common errors:
# ❌ WRONG: Forgot "enabled"
uid,variable,role
v001,age,"predictor,table1"
# Result: age will NOT be generated
# ✅ CORRECT
uid,variable,role
v001,age,"enabled,predictor,table1"
Multi-role filtering example:
# Get only variables for Table 1
table1_vars <- variables[grepl("table1", variables$role), ]
# Get predictors
predictors <- variables[grepl("predictor", variables$role), ]
# Get enabled variables for generation
enabled_vars <- variables[grepl("enabled", variables$role), ]position and seed: Generation order and reproducibility
Purpose: Control variable generation order and ensure reproducible independence
position:
- Generation order (ascending)
- Use increments of 10 (10, 20, 30…) for easy insertion
- Lower numbers generated first
seed:
- Random seed for this specific variable
-
Recommended:
seed = position × 10(e.g., position 20 → seed 200) - Prevents correlation artifacts between variables
Why position matters:
Some variables may depend on others being generated first (though MockData generally handles this automatically).
Why seed matters:
Using the same seed for all variables can create artificial correlations. Different seeds ensure statistical independence.
Examples:
# Good: Increments of 10, seed = position × 10
uid,variable,position,seed
v001,age,10,100
v002,smoking,20,200
v003,BMI,30,300
Common errors:
# ❌ WRONG: Same seed for all variables
uid,variable,position,seed
v001,age,10,123
v002,smoking,20,123
v003,BMI,30,123
# Result: Variables may be artificially correlated
# ❌ WRONG: Sequential positions (hard to insert)
uid,variable,position,seed
v001,age,1,10
v002,smoking,2,20
v003,BMI,3,30
# Result: Hard to insert new variable between age and smoking
Inserting variables:
# Original
uid,variable,position
v001,age,10
v003,BMI,30
# Insert smoking between age and BMI
uid,variable,position
v001,age,10
v002,smoking,20 # Fits perfectly
v003,BMI,30
sourceFormat vs sourceData
Purpose: Simulate raw data formats for harmonization pipeline testing
Column name: Currently sourceFormat in minimal-example (documentation shows sourceData)
Valid values:
| Value | Output type | Simulates | Example |
|---|---|---|---|
analysis |
R Date object | Analysis-ready dates | as.Date("2001-01-01") |
csv |
Character string | CSV file dates | "2001-01-01" |
sas |
Numeric | SAS date numeric |
14975 (days since 1960-01-01) |
Use case: Test date parsing/harmonization logic
Examples:
# Analysis-ready (default)
uid,variable,sourceFormat
v004,interview_date,analysis
# CSV import simulation
uid,variable,sourceFormat
v004,interview_date,csv
# SAS import simulation
uid,variable,sourceFormat
v004,interview_date,sas
Conversion examples:
# CSV to Date
dates_csv <- "2001-01-01"
as.Date(dates_csv)
# SAS to Date
dates_sas <- 14975
as.Date(dates_sas, origin = "1960-01-01")See Working with date variables for detailed examples.
distribution: Distribution types by variable type
Purpose: Specify how values are distributed across valid ranges
For categorical variables: Use NA (categories defined by proportions in variable_details.csv)
For continuous variables:
| Distribution | Parameters required | Use case | Example |
|---|---|---|---|
normal |
mean, sd
|
Age, BMI, normally-distributed measurements | Age: mean=50, sd=15 |
uniform |
None (uses recStart range) | Equal probability across range | Income brackets, uniform codes |
For date variables:
| Distribution | Parameters required | Use case | Example |
|---|---|---|---|
uniform |
None | Index dates, enrollment dates | Interview date |
gompertz |
rate, shape, followup_min, followup_max, event_prop
|
Age-related events (death, dementia) | Mortality with increasing hazard by age |
exponential |
rate, followup_min, followup_max, event_prop
|
Constant hazard events | Loss to follow-up |
Complete examples:
# Continuous: normal distribution
uid,variable,variableType,distribution,mean,sd
v001,age,Continuous,normal,50,15
# Continuous: uniform distribution
uid,variable,variableType,distribution
v010,income_bracket,Continuous,uniform
# Categorical: use NA (proportions in variable_details.csv)
uid,variable,variableType,distribution
v002,smoking,Categorical,
# Date: uniform (index date)
uid,variable,variableType,distribution
v004,interview_date,Date,uniform
# Date: gompertz survival (event date)
uid,variable,variableType,distribution,rate,shape,followup_min,followup_max,event_prop
v005,death_date,Date,gompertz,0.0001,0.1,365,7300,0.2
Common errors:
# ❌ WRONG: Normal distribution without mean/sd
uid,variable,distribution,mean,sd
v001,age,normal,,
# ❌ WRONG: Distribution for categorical variable
uid,variable,variableType,distribution
v002,smoking,Categorical,normal
# ✅ CORRECT
uid,variable,distribution,mean,sd
v001,age,normal,50,15
v002,smoking,,
Parameter interpretation:
- Normal: Values drawn from N(mean, sd²), truncated to recStart range
- Gompertz: Hazard increases exponentially with age (realistic mortality)
- Exponential: Constant hazard over time (random censoring)
Distribution comparison table
| Distribution | Shape | Median location | Skewness | Best for |
|---|---|---|---|---|
| normal | Bell curve | At mean | Symmetric | Age, BMI, height, normally-distributed measurements |
| uniform | Flat | Middle of range | Symmetric | Dates without temporal pattern, random assignment |
| gompertz | Right-skewed | Shifted toward end | Positive | Mortality, age-related disease onset |
| exponential | Right-skewed | Shifted toward start | Positive | Time to first event, loss to follow-up |
Visual interpretation (for 2001-2020 date range):
- uniform: Median ≈ 2010 (middle)
- gompertz: Median ≈ 2015-2018 (later years, increasing hazard)
- exponential: Median ≈ 2003-2007 (earlier years, constant hazard)
Frequently asked questions
General configuration
Q: What’s the difference between variables.csv and variable_details.csv?
A: variables.csv defines variable-level metadata (one row per variable). variable_details.csv defines detail-level specifications (multiple rows per variable for categories/ranges). Think of it as a one-to-many relationship.
Q: Can I use a single CSV file instead of two?
A: No. The two-file structure comes from recodeflow and is required. It enables:
- Multiple category definitions per variable
- Clean separation of variable-level vs category-level parameters
- Standardized harmonization workflow integration
Q: Do I need to fill in all columns?
A: No. Only required columns need values. Optional columns should use:
-
NAfor not applicable - Empty string for unknown (though
NApreferred) - Sentinel values like
[,]for empty ranges
Q: How do I know which columns are required?
A: See Quick reference table. Core required columns:
- variables.csv:
uid,variable,variableType - variable_details.csv:
uid,uid_detail,variable,recStart
Variable types
Q: When should I use Categorical vs Continuous for numeric codes?
A:
- Categorical: Discrete codes with specific meanings (smoking status: 1=never, 2=former, 3=current)
- Continuous: Numeric measurements on a scale (age, BMI, income)
Rule of thumb: If you would calculate a mean, use Continuous. If you would calculate proportions, use Categorical.
Q: Can I have a Continuous variable with only integer values?
A: Yes. Set variableType = "Continuous" and rType = "integer". Example: Age in years.
Q: What’s the difference between variableType and rType?
A:
-
variableType: Conceptual type from recodeflow (Categorical or Continuous) - determines generation logic -
rType: R output data type specified by MockData (integer, double, character, date, factor) - determines output format
Note: Date variables use variableType = "Continuous" and rType = "date". The variableType field comes from recodeflow and uses only Categorical/Continuous values.
Q: Should smoking status be factor or character?
A: Use rType = "factor" for categorical variables. Factors preserve category levels and enable proper statistical analysis.
Garbage data
Q: What’s the difference between garbage_low/garbage_high and prop_garbage?
A: Two modes:
-
Advanced (garbage_low/garbage_high): You specify exact invalid ranges
- Example: Age with
garbage_low_range = "[-5,10]"generates negative ages
- Example: Age with
-
Simple (prop_garbage): MockData auto-generates invalid values
- Example: Categorical with
prop_garbage = 0.05gets invalid codes like 99, 999
- Example: Categorical with
Precedence: If you specify garbage_low_prop OR garbage_high_prop, those take precedence. Otherwise prop_garbage is used.
Q: Why would I want garbage data?
A: To test data quality pipelines:
- Validate your data cleaning code catches impossible values
- Train analysts on real-world data quality issues
- Test edge cases in harmonization logic
Q: Can I have both low and high garbage data?
A: Yes. Specify both garbage_low_prop/garbage_low_range AND garbage_high_prop/garbage_high_range:
uid,variable,garbage_low_prop,garbage_low_range,garbage_high_prop,garbage_high_range
v003,BMI,0.02,"[-10,0]",0.01,"[60,150]"
Result: 2% have BMI < 0, 1% have BMI 60-150, rest have valid values.
Missing data
Q: What’s the difference between NA::a and NA::b?
A: Survey missing data classification:
-
NA::a (valid skip): Question not asked due to logic (respondent not eligible)
- Example: Pregnancy questions skipped for males
- Statistical treatment: Exclude from denominator
-
NA::b (missing response): Question asked but no valid answer (don’t know, refusal, not stated)
- Example: Respondent refused to answer income question
- Statistical treatment: Include in denominator, exclude from numerator
Q: Do missing codes count toward the proportion sum?
A: Yes. All proportions must sum to 1.0, including missing codes:
uid,recStart,recEnd,proportion
v002,1,1,0.50
v002,2,2,0.30
v002,3,3,0.17
v002,7,NA::b,0.03
# Sum = 1.00 ✓
Q: How do I generate missing data for continuous/date variables?
A: Use else in recStart with NA classification:
uid,variable,recStart,recEnd,proportion
v001,age,"[18,100]","copy",
v001,age,"else","NA::b",0.05
Result: 5% of age values will be NA (recoded from “else” = everything not in range).
Q: Can I use ranges for missing codes?
A: Yes:
# Map codes 997-999 to NA::b
uid,recStart,recEnd
v001,"[997,999]",NA::b
Distributions and parameters
Q: What happens if normal distribution generates values outside recStart range?
A: Values are truncated to the recStart range. Example:
uid,variable,distribution,mean,sd,recStart
v001,age,normal,50,15,"[18,100]"
Result: Normal distribution N(50, 15²) truncated to [18, 100]. No values < 18 or > 100.
Q: When should I use gompertz vs exponential for survival data?
A:
- Gompertz: Age-related events where hazard increases with time (mortality, dementia, chronic disease)
- Exponential: Constant hazard events where risk doesn’t change with time (loss to follow-up, random censoring)
Q: What are typical gompertz parameters?
A: For mortality in elderly cohorts:
distribution,rate,shape,followup_min,followup_max,event_prop
gompertz,0.0001,0.1,365,7300,0.2
-
rate = 0.0001: Baseline hazard -
shape = 0.1: Hazard acceleration (positive = increasing hazard with age) -
followup_max = 7300: 20 years -
event_prop = 0.2: 20% experience event
Q: Why do my date variables all have the same value?
A: Check:
- Are you using
distribution = "uniform"? (Required for variation) - Is
recStartan interval[start,end]not a single date? - Did you set different seeds for each variable?
UIDs and foreign keys
Q: Must uid in variable_details.csv match uid in variables.csv exactly?
A: Yes. This is a foreign key relationship. Every uid in variable_details.csv must exist in variables.csv.
Q: Can I have multiple variables with the same uid?
A: No. UIDs must be unique within variables.csv. Use different UIDs for different variables.
Q: Can variable_details.csv have gaps in uid_detail numbering?
A: Yes. uid_detail values don’t need to be sequential, just unique:
uid_detail,variable
d001,age
d_003,age
d_042,age
# Valid - gaps are OK
Q: What happens if I reference a uid that doesn’t exist in variables.csv?
A: Validation error:
Error: uid 'v_999' in variable_details.csv not found in variables.csv
Proportions and validation
Q: How strict is the “proportions must sum to 1.0” rule?
A: Tolerance ±0.01. These are valid:
- 1.00 ✓
- 0.99 ✓
- 1.01 ✓
- 0.98 ❌ (too far from 1.0)
Q: Do garbage data proportions count in the sum?
A: No. Garbage data (garbage_low_prop, garbage_high_prop, prop_garbage) is separate from population proportions.
Example:
# Proportions in variable_details.csv must sum to 1.0
uid,recStart,proportion
v002,1,0.50
v002,2,0.30
v002,3,0.20
# Sum = 1.00 ✓
# Garbage data in variables.csv is additional
uid,prop_garbage
v002,0.05
# Result: 95% valid (distributed 50/30/20), 5% invalid codes
Q: What if I want unequal category probabilities?
A: Specify exact proportions in variable_details.csv:
uid,recStart,catLabel,proportion
v002,1,Never smoker,0.60
v002,2,Former smoker,0.25
v002,3,Current smoker,0.15
# Reflects real population distribution
Database filtering and multi-cycle data
Q: What is databaseStart and when do I need it?
A: Database/cycle identifier for filtering. Needed when:
- Generating data for multiple survey cycles (CCHS 2001, 2005, 2009)
- Variables have database-specific category codes
- Testing multi-cycle harmonization
Q: How do I specify which databases a row applies to?
A: Use comma-separated list in databaseStart column (variable_details.csv):
uid,variable,databaseStart,recStart
v001,age,"cchs2001_p,cchs2005_p","[18,100]"
v002,smoking,cchs2001_p,1
v002,smoking,cchs2005_p,01
Row 1 applies to both databases. Rows 2-3 are database-specific (different codes for smoking).
Q: What if I’m only generating data for one database?
A: Use a single database name consistently:
uid,variable,databaseStart,recStart
v001,age,my_study,"[18,100]"
v002,smoking,my_study,1
Then generate with:
create_mock_data(
databaseStart = "my_study",
variables = variables,
variable_details = variable_details
)Troubleshooting
Validation errors
Error: “Proportions for variable ‘smoking’ sum to 0.97, expected 1.0”
Cause: Category proportions don’t sum to 1.0 (±0.01 tolerance)
Fix: Check proportions in variable_details.csv for that variable:
# Identify the problem
var_details %>%
filter(variable == "smoking") %>%
summarize(total = sum(proportion, na.rm = TRUE))
# Fix: Adjust proportions to sum to 1.0Error: “uid ‘v_042’ in variable_details.csv not found in variables.csv”
Cause: Foreign key violation - referenced uid doesn’t exist
Fix: Check for typos or missing rows:
# Find orphaned uids
details_uids <- unique(variable_details$uid)
vars_uids <- unique(variables$uid)
orphans <- setdiff(details_uids, vars_uids)
print(orphans)
# Fix: Either add missing uid to variables.csv or fix typo in variable_details.csvError: “Required column ‘recStart’ not found in variable_details.csv”
Cause: Missing required column
Fix: Add the missing column:
# Check what columns exist
names(variable_details)
# Add missing column (with default values if needed)
variable_details$recStart <- NAError: “Invalid rType value ‘float’ for variable ‘BMI’”
Cause: rType must be one of: integer, double, factor, date, character, logical
Fix: Use "double" not "float":
# ❌ WRONG
uid,variable,rType
v003,BMI,float
# ✅ CORRECT
uid,variable,rType
v003,BMI,double
Error: “distribution ‘normal’ requires ‘mean’ and ‘sd’ parameters”
Cause: Normal distribution missing required parameters
Fix: Specify mean and sd:
# ❌ WRONG
uid,variable,distribution,mean,sd
v001,age,normal,,
# ✅ CORRECT
uid,variable,distribution,mean,sd
v001,age,normal,50,15
Generation issues
Problem: All date variables have the same value
Possible causes:
- Used single date instead of interval in recStart
- Forgot to specify distribution
- Same seed for all variables
Fix:
# Check recStart uses interval notation
uid,variable,recStart
v004,interview_date,"[2001-01-01,2005-12-31]"
# Check distribution is specified
uid,variable,distribution
v004,interview_date,uniform
# Check different seeds
uid,variable,seed
v004,interview_date,400
v005,event_date,500
Problem: No variables generated (empty data frame)
Possible causes:
- No variables have
role = "enabled" - databaseStart filter excludes all rows
- All variables are derived variables
Fix:
# Check which variables are enabled
enabled <- variables[grepl("enabled", variables$role), ]
nrow(enabled) # Should be > 0
# Check databaseStart filtering
filtered <- variable_details[
grepl(databaseStart, variable_details$databaseStart),
]
nrow(filtered) # Should be > 0Problem: Proportions don’t match expected distribution
Cause: Forgot to account for missing data proportion
Fix: Missing data proportion reduces valid category proportions:
# If you want 50% never smokers in VALID responses:
uid,recStart,recEnd,proportion
v002,1,1,0.475 # 50% of 95% = 47.5%
v002,2,2,0.285 # 30% of 95% = 28.5%
v002,3,3,0.19 # 20% of 95% = 19%
v002,7,NA::b,0.05 # 5% missing
# Sum = 1.00
# Result: Among VALID responses, 50/30/20 split
Problem: Garbage data not appearing
Possible causes:
- Proportion too small to see in sample
- Wrong column name (garbage_low vs corrupt_low)
- Precedence issue (prop_garbage ignored if garbage_low_prop specified)
Fix:
# Increase proportion for testing
uid,variable,garbage_high_prop
v003,BMI,0.20 # 20% easier to verify than 1%
# Check correct column names (garbage_ not corrupt_)
uid,variable,garbage_low_prop,garbage_low_range
v001,age,0.05,"[-5,10]"
# If using advanced mode, don't specify prop_garbage
uid,variable,garbage_low_prop,garbage_low_range,prop_garbage
v001,age,0.05,"[-5,10]", # Leave prop_garbage empty
Performance issues
Problem: Generation very slow for large n
Cause: Complex distributions (especially normal) are slower than uniform
Solutions:
- Use uniform distribution where appropriate
- Generate in batches
- Simplify metadata (fewer variables, fewer categories)
# Batch generation example
batch_size <- 100000
n_batches <- 10
results <- lapply(1:n_batches, function(i) {
create_mock_data(
databaseStart = "my_study",
variables = variables,
variable_details = variable_details,
n = batch_size,
seed = 1000 + i # Different seed per batch
)
})
final_data <- bind_rows(results)Complete examples
Example 1: Basic categorical variable with missing codes
Use case: Smoking status with standard survey missing codes
# variables.csv
uid,variable,label,variableType,rType,role,position,seed,prop_garbage
v002,smoking,Smoking status,Categorical,factor,"enabled,predictor",20,200,0.05
# variable_details.csv
uid,uid_detail,variable,recStart,recEnd,catLabel,proportion
v002,d_002,smoking,1,1,Never smoker,0.50
v002,d_003,smoking,2,2,Former smoker,0.30
v002,d_004,smoking,3,3,Current smoker,0.17
v002,d_005,smoking,7,NA::b,Don't know,0.03
Result: 50% never/30% former/17% current smokers, 3% don’t know (NA), plus 5% invalid codes (99, 999) from prop_garbage.
Example 2: Continuous variable with normal distribution and precise garbage ranges
Use case: Body mass index with biologically impossible values for QA testing
# variables.csv
uid,variable,label,variableType,rType,role,position,seed,garbage_low_prop,garbage_low_range,garbage_high_prop,garbage_high_range,distribution,mean,sd
v003,BMI,Body mass index,Continuous,double,"enabled,outcome",30,300,0.02,"[-10,0]",0.01,"[60,150]",normal,27.5,5.2
# variable_details.csv
uid,uid_detail,variable,recStart,recEnd,catLabel,proportion
v003,d_006,BMI,"[15,50]",copy,Valid BMI range,
Result: Normal distribution N(27.5, 5.2²) truncated to [15, 50], with 2% negative BMI values and 1% extremely high (60-150) values.
Example 3: Continuous variable with missing data
Use case: Age with survey missing codes and uniform distribution
# variables.csv
uid,variable,label,variableType,rType,role,position,seed
v001,age,Age in years,Continuous,integer,"enabled,predictor,table1",10,100
# variable_details.csv
uid,uid_detail,variable,recStart,recEnd,catLabel,proportion
v001,d001,age,"[18,100]",copy,Valid age range,0.90
v001,d_002,age,997,NA::b,Don't know,0.05
v001,d_003,age,998,NA::b,Refusal,0.03
v001,d_004,age,999,NA::b,Not stated,0.02
Result: 90% valid ages uniformly distributed 18-100, 10% missing with codes 997/998/999 (mapped to NA).
Example 4: Date variable (cohort entry/index date)
Use case: Interview date as time origin for survival analysis
# variables.csv
uid,variable,label,variableType,rType,role,position,seed,distribution,sourceFormat
v004,interview_date,Interview date,Date,date,"enabled,metadata",40,400,uniform,analysis
# variable_details.csv
uid,uid_detail,variable,recStart,recEnd,catLabel,proportion
v004,d_007,interview_date,"[2001-01-01,2005-12-31]",copy,Interview date range,1
v004,d_008,interview_date,else,NA::b,Missing interview date,0
Result: Uniform distribution of dates 2001-2005 (all respondents interviewed), output as R Date objects.
Example 5: Survival variable with competing risks (gompertz distribution)
Use case: Primary event (dementia diagnosis) with age-related hazard and temporal garbage data
# variables.csv
uid,variable,label,variableType,rType,role,position,seed,garbage_high_prop,garbage_high_range,distribution,rate,shape,followup_min,followup_max,event_prop
v005,primary_event_date,Primary event date (dementia),Date,date,"enabled,outcome,table1",50,500,0.03,"[2021-01-01,2099-12-31]",gompertz,0.0001,0.1,0,5475,0.1
# variable_details.csv
uid,uid_detail,variable,recStart,recEnd,catLabel,proportion
v005,d_009,primary_event_date,"[2002-01-01,2021-01-01]",copy,Primary event date range,0.1
v005,d_010,primary_event_date,else,NA::b,Missing event date,0
Interpretation:
- 10% experience dementia diagnosis within 0-15 years after interview
- Event times follow gompertz distribution (increasing hazard with age)
- 3% have impossible future dates (2021-2099) for QA testing
- 90% censored (no event)
Example 6: Multi-cycle categorical variable with database-specific codes
Use case: Smoking status with different category codes across survey cycles
# variables.csv
uid,variable,label,variableType,rType,role,position,seed
v002,smoking,Smoking status,Categorical,factor,"enabled,predictor",20,200
# variable_details.csv
uid,uid_detail,variable,databaseStart,recStart,recEnd,catLabel,proportion
v002,d_002,smoking,cchs2001_p,1,1,Never smoker,0.50
v002,d_003,smoking,cchs2001_p,2,2,Former smoker,0.30
v002,d_004,smoking,cchs2001_p,3,3,Current smoker,0.20
v002,d_005,smoking,cchs2005_p,01,1,Never smoker,0.50
v002,d_006,smoking,cchs2005_p,02,2,Former smoker,0.30
v002,d_007,smoking,cchs2005_p,03,3,Current smoker,0.20
Result:
- CCHS 2001: Codes 1/2/3 (numeric)
- CCHS 2005: Codes 01/02/03 (zero-padded strings)
- Same proportions, different source codes
- Both map to harmonized values 1/2/3 (recEnd column)
Example 7: Derived variable (BMI from height and weight)
Use case: BMI calculated post-generation from height and weight
# variables.csv (height)
uid,variable,label,variableType,rType,role,position,seed,distribution,mean,sd
v004,height,Height in meters,Continuous,double,"enabled,predictor",40,400,normal,1.7,0.1
# variables.csv (weight)
uid,variable,label,variableType,rType,role,position,seed,distribution,mean,sd
v005,weight,Weight in kg,Continuous,double,"enabled,predictor",50,500,normal,75,15
# variables.csv (BMI_derived - NOT enabled)
uid,variable,label,variableType,rType,role,position,seed
v006,BMI_derived,BMI calculated from height and weight,Continuous,double,outcome,60,600
# variable_details.csv
uid,uid_detail,variable,recStart,recEnd,catLabel
v004,d_012,height,"[1.4,2.1]",copy,Valid height range (meters)
v005,d_014,weight,"[35,150]",copy,Valid weight range (kg)
v006,d_016,BMI_derived,"DerivedVar::[height, weight]","Func::bmi_fun",BMI calculated from height and weight
Generation workflow:
# 1. Generate height and weight
mock_data <- create_mock_data(
databaseStart = "my_study",
variables = variables,
variable_details = variable_details,
n = 1000
)
# Result: Contains height, weight (but NOT BMI_derived)
# 2. Calculate BMI_derived post-generation
bmi_fun <- function(height, weight) {
ifelse(
is.na(height) | is.na(weight) | height <= 0,
NA_real_,
weight / (height^2)
)
}
mock_data$BMI_derived <- bmi_fun(mock_data$height, mock_data$weight)Note: Derived variables are NOT generated by create_mock_data(). See Advanced topics: Derived variables for details.
Example 8: Complete survival analysis dataset with competing risks
Use case: Cohort study with primary event, death, and censoring
# variables.csv
uid,variable,label,variableType,rType,role,position,seed,distribution,rate,shape,followup_min,followup_max,event_prop
v004,interview_date,Interview date,Date,date,"enabled,metadata",40,400,uniform,,,,,,
v005,primary_event_date,Dementia diagnosis,Date,date,"enabled,outcome",50,500,gompertz,0.0001,0.1,0,5475,0.1
v006,death_date,Death date,Date,date,"enabled,outcome",60,600,gompertz,0.0001,0.1,365,7300,0.2
v_007,ltfu_date,Loss to follow-up,Date,date,"enabled,outcome",70,700,uniform,,,365,7300,0.1
v_008,admin_censor_date,Administrative censoring,Date,date,"enabled,metadata",80,800,,,,365,7300,1
# variable_details.csv
uid,uid_detail,variable,recStart,recEnd,proportion
v004,d_007,interview_date,"[2001-01-01,2005-12-31]",copy,1
v005,d_009,primary_event_date,"[2002-01-01,2021-01-01]",copy,0.1
v006,d_011,death_date,"[2002-01-01,2024-12-31]",copy,0.2
v_007,d_013,ltfu_date,"[2002-01-01,2024-12-31]",copy,0.1
v_008,d_015,admin_censor_date,2024-12-31,copy,1
Interpretation:
- Index date: Interview 2001-2005 (100% have date)
- Primary event: 10% develop dementia within 0-15 years (gompertz hazard)
- Competing risk: 20% die within 1-20 years (gompertz hazard)
- Censoring: 10% lost to follow-up within 1-20 years (uniform)
- Administrative: All censored at 2024-12-31
Result: Realistic competing risks dataset with:
- Some experience primary event before death/censoring
- Some die before primary event
- Some censored (lost to follow-up or administrative)
- No individual can have more than one terminal event
See Tutorial: Generating survival data with competing risks for complete workflow.
Example 9: Categorical variable with skip logic (NA::a)
Use case: Pregnancy question with valid skip for males/postmenopausal
# variables.csv
uid,variable,label,variableType,rType,role,position,seed
v010,currently_pregnant,Currently pregnant,Categorical,factor,"enabled,outcome",100,1000
# variable_details.csv
uid,uid_detail,variable,recStart,recEnd,catLabel,proportion
v010,d_020,currently_pregnant,1,1,Yes,0.05
v010,d_021,currently_pregnant,2,2,No,0.30
v010,d_022,currently_pregnant,6,NA::a,Valid skip (not applicable),0.60
v010,d_023,currently_pregnant,9,NA::b,Not stated,0.05
Result:
- 5% currently pregnant
- 30% not pregnant
- 60% valid skip (males, postmenopausal females - not eligible for question)
- 5% eligible but didn’t answer
Statistical interpretation:
- Denominator for prevalence: 0.05 + 0.30 + 0.05 = 0.40 (eligible respondents)
- Prevalence among eligible: 0.05 / 0.40 = 12.5%
- Response rate: (0.05 + 0.30) / 0.40 = 87.5%
Example 10: Versioned metadata with garbage data evolution
Use case: Tracking changes to QA testing parameters over time
# variables.csv (version 1.0.0)
uid,variable,garbage_low_prop,garbage_low_range,mockDataVersion,mockDataVersionNotes
v001,age,0.01,"[-5,10]",1.0.0,Initial version with minimal contamination
# variables.csv (version 1.1.0 - increased QA testing)
uid,variable,garbage_low_prop,garbage_low_range,mockDataVersion,mockDataLastUpdated,mockDataVersionNotes
v001,age,0.05,"[-10,15]",1.1.0,2025-11-15,Increased contamination for comprehensive QA testing
Use case:
- Track metadata evolution
- Document why garbage data proportions changed
- Reproduce exact mock datasets from specific versions
- Audit trail for scientific publications
Edge cases and special configurations
Single-value categorical variable
Use case: Binary variable with only two categories
uid,uid_detail,variable,recStart,recEnd,proportion
v020,d_040,diabetes,0,0,0.90
v020,d_041,diabetes,1,1,0.10
Result: 90% no diabetes (0), 10% diabetes (1).
Administrative date (all same value)
Use case: Data freeze date - everyone has same value
# variables.csv
uid,variable,variableType,rType,distribution
v_008,admin_censor_date,Date,date,
# variable_details.csv
uid,uid_detail,variable,recStart,recEnd,proportion
v_008,d_015,admin_censor_date,2024-12-31,copy,1
Result: All rows have 2024-12-31 (single date, not a range).
Continuous variable with missing data using “else”
Use case: Height with catch-all missing pattern
uid,uid_detail,variable,recStart,recEnd,proportion
v004,d_012,height,"[1.4,2.1]",copy,
v004,d_013,height,else,NA::b,0.02
Result: 98% valid heights [1.4, 2.1], 2% coded as else → NA.
Zero-variance continuous variable (for testing)
Use case: Constant value for all observations
# variables.csv
uid,variable,distribution,mean,sd
v030,constant_value,normal,100,0.001
# variable_details.csv
uid,uid_detail,variable,recStart
v030,d_060,constant_value,"[99.99,100.01]"
Result: All values ≈ 100 (effectively constant given narrow range and tiny SD).
Schema design principles
Core principles
1. Complete grid: Every cell has an explicit value
- Use
NAfor not applicable - Use
""(empty string) only when value is truly unknown - Use sentinel values
[,]for empty ranges - No implicit defaults - be explicit
2. Separation of concerns:
- variables.csv: Variable-level metadata (one row per variable)
- variable_details.csv: Category/range specifications (multiple rows per variable)
- Generation logic: In R package functions, not in metadata
3. Traceability:
- UIDs provide permanent identifiers
- Versioning columns track metadata evolution
- Foreign keys link files (uid column)
4. Composability:
- Variables can be generated independently
- Metadata can be shared across databases (databaseStart filtering)
- Derived variables calculated post-generation
5. Validation-first:
- Schema validates before generation
- Fail fast with clear error messages
- Sum-to-one enforcement prevents silent errors
Best practices checklist
Before creating metadata:
During metadata creation:
After metadata creation:
For production use:
Migration from older versions
From recodeflow metadata (no MockData extensions)
If you have existing recodeflow metadata without MockData extension columns:
1. Add required extension columns to variables.csv:
# Minimal additions
variables$rType <- "double" # Or appropriate type
variables$role <- "enabled"
variables$position <- seq(10, by = 10, length.out = nrow(variables))
variables$seed <- variables$position * 102. Add distribution columns (if needed):
# For continuous variables
continuous_vars <- variables$variableType == "Continuous"
variables$distribution[continuous_vars] <- "uniform"
# For date variables
date_vars <- variables$variableType == "Date"
variables$distribution[date_vars] <- "uniform"3. Add recEnd to variable_details.csv:
# For continuous/date variables with ranges
variable_details$recEnd[is.na(variable_details$recEnd)] <- "copy"
# For categorical variables (map codes to themselves)
# Requires manual review to identify missing codes4. Validate and test:
# Load and validate
mock_data <- create_mock_data(
databaseStart = "your_database",
variables = variables,
variable_details = variable_details,
n = 100
)
# Verify structure
str(mock_data)
summary(mock_data)From MockData v0.1.x to v0.2.x
Key changes in v0.2.0:
1. Garbage data column names changed:
# OLD (v0.1.x)
corrupt_low_prop,corrupt_low_range,corrupt_high_prop,corrupt_high_range
# NEW (v0.2.x)
garbage_low_prop,garbage_low_range,garbage_high_prop,garbage_high_range
Migration:
# Rename columns
names(variables)[names(variables) == "corrupt_low_prop"] <- "garbage_low_prop"
names(variables)[names(variables) == "corrupt_low_range"] <- "garbage_low_range"
names(variables)[names(variables) == "corrupt_high_prop"] <- "garbage_high_prop"
names(variables)[names(variables) == "corrupt_high_range"] <- "garbage_high_range"2. Distribution parameters moved to variables.csv:
In v0.1.x, some parameters were in variable_details.csv. In v0.2.x, all distribution parameters are in variables.csv.
3. New versioning columns available:
mockDataVersion,mockDataLastUpdated,mockDataVersionNotes
Add these to track metadata changes over time.
Performance optimization
For large datasets (n > 1M)
1. Use simpler distributions:
# Slower
distribution,mean,sd
normal,50,15
# Faster
distribution
uniform
2. Minimize variables:
Only include variables you actually need. Each variable adds generation time.
3. Batch generation:
# Generate in chunks
batch_size <- 100000
n_batches <- ceiling(total_n / batch_size)
results <- lapply(1:n_batches, function(i) {
create_mock_data(
databaseStart = "my_study",
variables = variables,
variable_details = variable_details,
n = min(batch_size, total_n - (i-1)*batch_size),
seed = base_seed + i
)
})
final_data <- bind_rows(results)4. Simplify metadata:
- Reduce number of categories per variable
- Use uniform instead of normal distributions
- Minimize garbage data (only for QA testing, not production)
Memory management
For n > 10M rows:
# Generate in batches and write to disk
for (i in 1:n_batches) {
batch <- create_mock_data(
databaseStart = "my_study",
variables = variables,
variable_details = variable_details,
n = batch_size,
seed = base_seed + i
)
# Write to disk immediately
write.csv(batch, paste0("mock_data_batch_", i, ".csv"), row.names = FALSE)
# Free memory
rm(batch)
gc()
}
# Combine later if needed
all_files <- list.files(pattern = "mock_data_batch_.*\\.csv")
combined <- do.call(rbind, lapply(all_files, read.csv))Reference implementation
See inst/extdata/minimal-example/ for a complete working example with 11 variables (Categorical : 1, Continuous : 10) and 26 detail rows demonstrating:
- All 20 variable-level extension columns (plus 4 core recodeflow columns)
- All 3 detail-level extension columns (plus 4 core recodeflow columns)
- All variable types (integer, factor, double, date)
- All garbage data patterns (advanced low/high, simple prop_garbage)
- All distribution types (normal, uniform, gompertz)
- Survival analysis with competing risks
- UID-based foreign keys
- Interval notation throughout
- Complete grid principle (no empty strings, explicit NA/sentinels)
This example validates successfully and can be used as a template for new projects.