Configuration reference

About this vignette: This reference document provides the complete configuration schema specification for MockData v0.2.1. For step-by-step tutorials, see Generating datasets from configuration files.

Finding specific columns

This is a comprehensive reference with 20+ configuration columns. Use your browser’s search (Ctrl+F / Cmd+F) to quickly find specific columns by name. Jump to sections: variables.csv | variable_details.csv | Examples

Quick reference

Essential columns for getting started:

Column	File	Required	Purpose	Example
`uid`	variables.csv	Yes	Unique variable identifier	`"v001"`
`variable`	variables.csv	Yes	Output column name	`"age"`
`variableType`	variables.csv	Yes	Categorical or Continuous (from recodeflow)	`"Continuous"`
`rType`	variables.csv	No	R output type (integer, double, character, date, factor)	`"integer"`
`distribution`	variables.csv	Continuous/Date	normal, uniform, gompertz	`"normal"`
`mean`, `sd`	variables.csv	Normal dist	Distribution parameters	`50`, `15`
`garbage_low_prop`	variables.csv	No	QA testing (low values)	`0.01`
`garbage_high_prop`	variables.csv	No	QA testing (high values)	`0.03`
`uid_detail`	variable_details.csv	Yes	Unique detail identifier	`"d001"`
`recStart`	variable_details.csv	Yes	Category code or range	`"1"` or `"[18,100]"`
`recEnd`	variable_details.csv	Conditional	Missing data classification	`"1"`, `"NA::a"`, `"NA::b"`
`catLabel`	variable_details.csv	No	Category label	`"Never smoker"`
`proportion`	variable_details.csv	Categorical	Category probability (0-1)	`0.5`

For complete column documentation, see sections below.

How this vignette is generated

This vignette uses inline R code to generate documentation directly from the actual example data. Column counts, CSV examples, and dataset summaries are calculated dynamically from:

inst/extdata/minimal-example/variables.csv

inst/extdata/minimal-example/variable_details.csv

This approach ensures the documentation stays synchronized with the package and serves as integration testing during package builds.

Overview

MockData uses a two-file configuration system to define mock datasets:

variables.csv - Variable-level metadata and generation parameters (24 columns total: 4 core + 20 extensions)
variable_details.csv - Detail-level specifications for categories and ranges (7 columns: 4 core + 3 extensions)

This reference documents the complete v0.2.1 schema, including all extension columns, interval notation, and validation rules.

File: variables.csv

Purpose: Variable-level metadata and generation parameters

Core columns (from recodeflow)

Column	Type	Required	Description	Example
`uid`	character	Yes	Unique identifier for this variable	`"v001"`
`variable`	character	Yes	Variable name (column in output)	`"age"`
`label`	character	No	Human-readable description	`"Age in years"`
`variableType`	character	Yes	From recodeflow: `"Categorical"` or `"Continuous"`	`"Continuous"`

UID format: Use pattern vNNN with zero-padded numbers (e.g., v001, v002, v010)

Extension columns (MockData-specific)

Type and generation control

Column	Type	Description	Values	Example
`rType`	character	R data type for output	`"integer"`, `"double"`, `"character"`, `"date"`, `"factor"`	`"integer"`
`role`	character	Multi-valued roles (comma-separated)	`"enabled,predictor,table1"`	`"enabled,predictor"`
`position`	integer	Generation order (use increments of 10)	`10`, `20`, `30`	`10`
`seed`	integer	Random seed for this variable	Any integer	`100`

Role values:

enabled - Generate this variable (required for generation)
predictor - Use in regression models
outcome - Outcome variable
metadata - Metadata/administrative variable
table1 - Include in Table 1 summary

Seed pattern: Recommended: seed = position × 10 (ensures reproducibility and prevents correlation artifacts)

Garbage data (data quality testing)

Column	Type	Description	Values	Example
`garbage_low_prop`	numeric	Proportion of low-range garbage data	0 to 1	`0.01`
`garbage_low_range`	character	Range for low garbage data	Interval notation	`"[-5,10]"`
`garbage_high_prop`	numeric	Proportion of high-range garbage data	0 to 1	`0.03`
`garbage_high_range`	character	Range for high garbage data	Interval notation	`"[120,150]"`
`prop_garbage`	numeric	Proportion of auto-generated invalid values	0 to 1	`0.05`

Two garbage data modes:

Advanced (precise control): Use garbage_low_prop/garbage_low_range and/or garbage_high_prop/garbage_high_range to specify exact ranges
Simple (auto-generated): Use prop_garbage for automatic invalid value generation

Precedence: If garbage_low_prop OR garbage_high_prop specified, those take precedence. Otherwise, prop_garbage is used.

Interpretation by variable type:

Categorical: prop_garbage generates invalid codes (99, 999, 88, etc. not in valid categories)
Continuous: Advanced uses specified ranges; simple generates out-of-range values
Date: Advanced uses specified date ranges; simple generates dates 1-5 years before/after valid range

Examples:

# Advanced: Age with precise low garbage data (negative ages)
uid,variable,garbage_low_prop,garbage_low_range,garbage_high_prop,garbage_high_range,prop_garbage
v001,age,0.01,"[-5,10]",NA,"[,]",NA

# Advanced: BMI with two-sided garbage data (2% low + 1% high)
uid,variable,garbage_low_prop,garbage_low_range,garbage_high_prop,garbage_high_range,prop_garbage
v003,BMI,0.02,"[-10,0]",0.01,"[60,150]",NA

# Simple: Smoking with auto-generated invalid codes
uid,variable,garbage_low_prop,garbage_low_range,garbage_high_prop,garbage_high_range,prop_garbage
v002,smoking,NA,NA,NA,NA,0.05

# Simple: Death date with auto-generated out-of-period dates
uid,variable,garbage_low_prop,garbage_low_range,garbage_high_prop,garbage_high_range,prop_garbage
v006,death_date,NA,"[,]",NA,"[,]",0.02

Sentinel values: Use NA for not applicable, "[,]" for empty ranges.

Distribution parameters (continuous and date variables)

Column	Type	Description	Values	Example
`distribution`	character	Distribution type	`"normal"`, `"uniform"`, `"gompertz"`, `"exponential"`	`"normal"`
`mean`	numeric	Mean (normal distribution)	Any number	`50`
`sd`	numeric	Standard deviation (normal)	Positive number	`15`
`rate`	numeric	Rate parameter (gompertz/exponential)	Positive number	`0.0001`
`shape`	numeric	Shape parameter (gompertz)	Any number	`0.1`

Distribution types:

For continuous variables:

"normal" - Normal (Gaussian) distribution (requires mean, sd)
"uniform" - Uniform distribution over valid range

For date variables:

"uniform" - Equal probability for all dates
"gompertz" - Age-related hazard (requires rate, shape, followup_min, followup_max, event_prop)
"exponential" - Constant hazard (requires rate, followup_min, followup_max, event_prop)

For categorical variables: Set to NA (categories defined by proportions in variable_details.csv)

Examples:

# Age: normal distribution
uid,variable,distribution,mean,sd,rate,shape
v001,age,normal,50,15,NA,NA

# BMI: normal distribution
uid,variable,distribution,mean,sd,rate,shape
v003,BMI,normal,27.5,5.2,NA,NA

# Interview date: uniform
uid,variable,distribution,mean,sd,rate,shape
v004,interview_date,uniform,NA,NA,NA,NA

# Primary event: gompertz survival
uid,variable,distribution,mean,sd,rate,shape
v005,primary_event_date,gompertz,NA,NA,0.0001,0.1

Survival parameters (date variables with events)

Column	Type	Description	Values	Example
`followup_min`	integer	Minimum follow-up days	Positive integer	`365`
`followup_max`	integer	Maximum follow-up days	Positive integer	`5475`
`event_prop`	numeric	Proportion experiencing event	0 to 1	`0.1`

When to use:

Date variables representing events (death_date, disease_diagnosis, etc.)
NOT for index dates (interview_date) - those are the time origin

Example:

# Primary event date: 10% experience dementia diagnosis within 1-15 years
uid,variable,distribution,followup_min,followup_max,event_prop
v005,primary_event_date,gompertz,365,5475,0.1

# Death date: 20% die within 1-20 years (competing risk)
uid,variable,distribution,followup_min,followup_max,event_prop
v006,death_date,gompertz,365,7300,0.2

# Loss to follow-up: 10% lost within 1-20 years (censoring)
uid,variable,distribution,followup_min,followup_max,event_prop
v_007,ltfu_date,uniform,365,7300,0.1

Versioning

Column	Type	Description	Format	Example
`mockDataVersion`	character	Semantic version	`MAJOR.MINOR.PATCH`	`"1.0.0"`
`mockDataLastUpdated`	character	Last update date	`YYYY-MM-DD`	`"2025-11-09"`
`mockDataVersionNotes`	character	Version notes	Free text	`"Initial version"`

Use cases:

Track changes to generation parameters over time
Document why garbage data values changed
Maintain audit trail for reproducibility

Complete example

"uid","variable","label","variableType","rType","role","position","seed","garbage_low_prop","garbage_low_range","garbage_high_prop","garbage_high_range","distribution","mean","sd","rate","shape","followup_min","followup_max","event_prop","sourceFormat","mockDataVersion","mockDataLastUpdated","mockDataVersionNotes"
"cchsflow_v0001","age","Age in years","Continuous","integer","enabled,predictor,table1",10,10,NA,"[;]",NA,"[;]","normal",50,15,NA,NA,NA,NA,NA,"","1.0.0","2025-11-09","Normal distribution (mean=50, sd=15)"
"cchsflow_v0002","smoking","Smoking status","Categorical","factor","enabled,predictor,table1",20,20,NA,"",NA,"","",NA,NA,NA,NA,NA,NA,NA,"","1.0.0","2025-11-09","Categorical variable with proportions"
"cchsflow_v0003","BMI","Body mass index","Continuous","double","enabled,outcome,table1",30,30,0.02,"[-10;15])",0.01,"[60;150]","normal",27.5,5.2,NA,NA,NA,NA,NA,"","1.0.0","2025-11-09","Normal distribution with two-sided contamination"
"cchsflow_v0004","height","Height in meters","Continuous","double","enabled,predictor",40,40,1,"[0;1.4)",0.01,"(2.1;inf]","normal",1.7,0.1,NA,NA,NA,NA,NA,"","1.0.0","2025-11-13","Height for BMI calculation"
"cchsflow_v0005","weight","Weight in kilograms","Continuous","double","enabled,predictor",50,50,NA,"[;]",NA,"[;]","normal",75,15,NA,NA,NA,NA,NA,"","1.0.0","2025-11-13","Weight for BMI calculation"
"cchsflow_v0006","BMI_derived","BMI calculated from height and weight","Continuous","double","enabled,outcome,table1",60,60,NA,"[;]",NA,"[;]","",NA,NA,NA,NA,NA,NA,NA,"","1.0.0","2025-11-13","Derived variable: BMI = weight / (height^2)"
"ices_v01","interview_date","Interview date (cohort entry)","Continuous","date","enabled,outcome",70,70,0,"[;]",0,"[;]","uniform",NA,NA,NA,NA,NA,NA,NA,"analysis","1.0.0","2025-11-09","Uniform date range (cohort entry)"
"ices_v02","primary_event_date","Primary event date (dementia diagnosis)","Continuous","date","enabled,outcome,table1",80,80,0,"[;]",0.03,"[2021-01-01;2099-12-31]","gompertz",NA,NA,1e-04,0.1,0,5475,0.3,"analysis","1.0.0","2025-11-09","Gompertz survival with temporal violations (3%)"
"ices_v03","death_date","Death date (competing risk)","Continuous","date","enabled,outcome, table1",90,90,0,"[;]",0.03,"[2025-01-01;2099-12-31]","gompertz",NA,NA,1e-04,0.1,365,7300,0.2,"analysis","1.0.0","2025-11-09","Gompertz survival with auto-generated date corruption"
"ices_v04","ltfu_date","Loss to follow-up date","Continuous","date","enabled,outcome",100,100,0,"[;]",0.03,"[2025-01-01;2099-12-31]","uniform",NA,NA,NA,NA,365,7300,0.1,"analysis","1.0.0","2025-11-09","Uniform censoring (10% occurrence)"
"ices_v05","admin_censor_date","admin_censor_date","Continuous","date","enabled,outcome",110,110,0,"[;]",0,"[;]","",NA,NA,NA,NA,365,7300,1,"analysis","1.0.0","2025-11-09",""

File: variable_details.csv

Purpose: Detail-level specifications for categories and ranges

Core columns (from recodeflow)

Column	Type	Required	Description	Example
`uid`	character	Yes	Foreign key to variables.csv	`"v001"`
`uid_detail`	character	Yes	Unique identifier for this row	`"d001"`
`variable`	character	Yes	Must match variable in variables.csv	`"age"`
`recStart`	character	Yes	Input value or range	`"[18,100]"` or `"1"`

UID relationships:

uid must exist in variables.csv (foreign key)
uid_detail must be unique across entire file
Pattern: d_NNN with zero-padded numbers

Extension columns (MockData-specific)

Column	Type	Description	Example
`catLabel`	character	Category label or description	`"Valid age range"` or `"Never smoker"`
`proportion`	numeric	Population proportion (0-1)	`0.5`

Proportion rules:

Must sum to 1.0 per variable (for categorical variables)
Use NA for continuous/date variables with single range specification

Complete example

"uid","uid_detail","variable","recStart","recEnd","catLabel","proportion"
"cchsflow_v0001","cchsflow_d00001","age","[18,100]","copy","Valid age range",0.9
"cchsflow_v0001","cchsflow_d00002","age","997","NA::b","Don't know",0.05
"cchsflow_v0001","cchsflow_d00003","age","998","NA::b","Refusal",0.03
"cchsflow_v0001","cchsflow_d00004","age","999","NA::b","Not stated",0.02
"cchsflow_v0002","cchsflow_d00005","smoking","1","1","Never smoker",0.5
"cchsflow_v0002","cchsflow_d00006","smoking","2","2","Former smoker",0.3
"cchsflow_v0002","cchsflow_d00007","smoking","3","3","Current smoker",0.17
"cchsflow_v0002","cchsflow_d00008","smoking","7","NA::b","Don't know",0.03
"cchsflow_v0003","cchsflow_d00009","BMI","[15,50]","copy","Valid BMI range",NA
"cchsflow_v0003","cchsflow_d00010","BMI","996","NA::a","Not applicable",0.3
"cchsflow_v0003","cchsflow_d00011","BMI","[997,999]","NA::b","Don't know, refusal, not stated",0.1
"cchsflow_v0004","cchsflow_d00012","height","[1.4,2.1]","copy","Valid height range (meters)",NA
"cchsflow_v0004","cchsflow_d00013","height","else","NA::b","Missing height",0.02
"cchsflow_v0005","cchsflow_d00014","weight","[35,150]","copy","Valid weight range (kg)",NA
"cchsflow_v0005","cchsflow_d00015","weight","else","NA::b","Missing weight",0.03
"cchsflow_v0006","cchsflow_d00016","BMI_derived","DerivedVar::[height, weight]","Func::bmi_fun","BMI calculated from height and weight",NA
"ices_v01","ices_d001","interview_date","[2001-01-01,2005-12-31]","copy","Interview date range",1
"ices_v01","ices_d002","interview_date","else","NA::b","Missing interview date",0
"ices_v02","ices_d003","primary_event_date","[2002-01-01,2021-01-01]","copy","Primary event date range",0.1
"ices_v02","ices_d004","primary_event_date","else","NA::b","Missing event date",0
"ices_v03","ices_d005","death_date","[2002-01-01,2024-12-31]","copy","Death date range",0.2
"ices_v03","ices_d006","death_date","else","NA::b","Missing death date",0.05
"ices_v04","ices_d007","ltfu_date","[2002-01-01,2024-12-31]","copy","Loss to follow-up date range",0.05
"ices_v04","ices_d008","ltfu_date","else","NA::b","Missing ltfu date",NA
"ices_v05","ices_d009","admin_censor_date","2024-12-31","copy","Administrative censor date",1
"ices_v05","ices_d010","admin_censor_date","else","NA::b","Missing administrative censor date",0

Note: Distribution parameters (mean, sd, rate, shape, etc.) are specified in variables.csv, NOT in variable_details.csv.

recStart syntax

The recStart column specifies input values or ranges using either single values or interval notation.

Single values (categorical variables)

For categorical variables, specify exact category codes:

uid,recStart,catLabel,proportion
v002,1,Never smoker,0.50
v002,2,Former smoker,0.30
v002,3,Current smoker,0.20

Interval notation (continuous and date variables)

For continuous or date variables, use interval notation: [min,max]

Format: [min,max] with comma delimiter

Common bracket types:

[a,b] - Inclusive on both ends (most common)
[a,b) - Inclusive start, exclusive end

Examples:

# Numeric range: age 18 to 100 (inclusive)
uid,recStart
v001,"[18,100]"

# Date range: interview dates 2001-2005
uid,recStart
v004,"[2001-01-01,2005-12-31]"

# Valid range for distribution truncation
uid,recStart
v003,"[18,40]"

Important: Always use double quotes around interval notation in CSV files to ensure comma inside brackets is not treated as column delimiter.

recEnd for missing data classification

The recEnd column classifies codes into missing vs. valid categories, enabling automatic missing data generation.

Purpose: Distinguishes between:

Valid response codes (1, 2, 3)
Skip codes (6, 96, 996 - question not applicable)
Missing codes (7-9, 97-99 - don’t know, refusal, not stated)

Conditional requirement: Required when recStart contains missing data codes (6-9, 96-99) to enable proper classification.

Valid response codes

Map input codes to themselves using numeric values:

uid,uid_detail,variable,recStart,recEnd,catLabel,proportion
v002,d_002,smoking,1,1,Never smoker,0.50
v002,d_003,smoking,2,2,Former smoker,0.30
v002,d_004,smoking,3,3,Current smoker,0.17

Pattern: recStart="1" → recEnd="1" (code maps to itself)

Skip codes: NA::a

For valid skip/not applicable codes (question not asked due to logic):

uid,uid_detail,variable,recStart,recEnd,catLabel,proportion
v002,d_005,smoking,6,NA::a,Valid skip,0.01

Statistical treatment: Exclude from denominator (respondent was not eligible for question)

Common codes: 6, 96, 996 (varies by survey)

Missing codes: NA::b

For don’t know, refusal, not stated (question asked but no valid response):

uid,uid_detail,variable,recStart,recEnd,catLabel,proportion
v002,d_006,smoking,7,NA::b,Don't know,0.02
v002,d_007,smoking,9,NA::b,Not stated,0.01

Statistical treatment: Include in denominator when calculating response rates, exclude from numerator

Common codes: 7 (don’t know), 8 (refusal), 9 (not stated), 97, 98, 99

Range notation: Can use [7,9] → NA::b to map multiple codes at once

Continuous and date variables

Use "copy" for valid ranges:

uid,uid_detail,variable,recStart,recEnd,catLabel
v001,d001,age,"[18,100]",copy,Valid age range
v004,d_007,interview_date,"[2001-01-01,2005-12-31]",copy,Interview date range

Pattern: recEnd=“copy” indicates the range should be used as-is for generation

Complete example with missing data

uid,uid_detail,variable,recStart,recEnd,catLabel,proportion
v002,d_002,smoking,1,1,Daily smoker,0.25
v002,d_003,smoking,2,2,Occasional,0.15
v002,d_004,smoking,3,3,Never,0.57
v002,d_005,smoking,7,NA::b,Don't know,0.03
# Sum: 1.00 ✓

Result: When generating data, get_variable_categories(include_na=TRUE) returns only code “7”, while get_variable_categories(include_na=FALSE) returns codes “1”, “2”, “3”.

Why recEnd is required

Without recEnd: Cannot distinguish between:

Code 1 (valid response)
Code 7 (missing - don’t know)

With recEnd: Explicit classification enables:

Automatic prop_NA parameter handling
Correct missing vs. valid proportions
Statistical analysis (response rates, prevalence)

Validation: If recStart contains codes 6-9 or 96-99 and recEnd column is missing, validation will error with instructions.

Proportions

Basic rules

Must sum to 1.0 per variable (for categorical distributions)
Garbage data is separate - corrupt_*_prop and prop_garbage not included in proportion sum
Event proportions - event_prop in variables.csv may be < 1.0 (represents censoring)

Categorical variables

Proportions define population distribution:

uid,recStart,catLabel,proportion
v002,1,Never,0.50
v002,2,Former,0.30
v002,3,Current,0.17
v002,7,Missing,0.03
# Sum: 1.00 ✓

Continuous and date variables

Use NA for proportion:

uid,recStart,catLabel,proportion
v001,"[18,100]","Valid age range",NA
v004,"[2001-01-01,2005-12-31]","Interview date range",NA

With missing codes (categorical)

Missing codes are part of population (must sum to 1.0):

uid,recStart,catLabel,proportion
v002,1,Never smoker,0.50
v002,2,Former smoker,0.30
v002,3,Current smoker,0.17
v002,7,Don't know,0.03
# Sum: 1.00 ✓

Validation rules

MockData validates configuration files on load:

UID validation

Pattern: uid should follow format vNNN (e.g., v001, v042), uid_detail should follow dNNN (e.g., d001, d156)
Uniqueness: All uid values unique in variables.csv
Uniqueness: All uid_detail values unique in variable_details.csv
Foreign keys: All uid in variable_details exist in variables.csv

Column validation

Required columns present:
- variables.csv: uid, variable, variableType
- variable_details.csv: uid, uid_detail, variable, recStart
Variable name match: variable_details.variable matches variables.variable
Column placement: Extension columns only in correct files (e.g., distribution params in variables.csv)

Value validation

Proportions: Sum to 1.0 per variable (±0.01 tolerance)
Garbage data proportions: Between 0 and 1
Garbage data ranges: Use interval notation [min,max] or sentinel "[,]"
rType values: One of: integer, double, factor, date, character, logical
distribution values: One of: normal, uniform, gompertz, exponential, or NA
Versioning: mockDataVersion follows semantic versioning (if present)
Complete grid: All cells have explicit values (no empty strings for optional columns - use NA or sentinel values)

Detailed column reference

uid and uid_detail: Unique identifiers

Purpose: Permanent identifiers for traceability across metadata versions

Format requirements:

uid: Pattern vNNN with zero-padding (e.g., v001, v042)
uid_detail: Pattern dNNN with zero-padding (e.g., d001, d156)
Zero-padding recommended for sorting: v001 not v1

Common errors:

# ❌ WRONG: Inconsistent padding
uid,variable
v_1,age
v_10,smoking
v_2,BMI
# Result: Sorts incorrectly (v_1, v_10, v_2)

# ✅ CORRECT: Zero-padded
uid,variable
v001,age
v002,smoking
v010,BMI
# Result: Sorts correctly (v001, v002, v_010)

Edge cases:

UIDs must be globally unique across all databases
Reusing UIDs across databases for “same variable” is acceptable but requires consistent definitions
Changing UID means “new variable” even if variable name unchanged

Best practices:

Start numbering at v_001 (not v_000)
Use sequential numbers for easy tracking
Document UID assignment logic in project README
Use project prefixes for multi-project repos: cchs_v001, chms_v001

role: Multi-valued variable roles

Purpose: Tag variables for different purposes in analysis

Valid values (comma-separated):

Value	Meaning	Use case
`enabled`	Required for generation	Must be present to generate variable
`predictor`	Independent/explanatory variable	Regression models, Table 1
`outcome`	Dependent/response variable	Primary/secondary outcomes
`metadata`	Administrative/tracking	IDs, dates, survey metadata
`table1`	Summary table variable	Baseline characteristics

Examples:

# Age: predictor for models, show in Table 1
uid,variable,role
v001,age,"enabled,predictor,table1"

# Primary outcome: show in Table 1
uid,variable,role
v005,dementia_diagnosis,"enabled,outcome,table1"

# Study ID: just metadata
uid,variable,role
v020,study_id,"enabled,metadata"

# Derived variable: not enabled (calculated post-generation)
uid,variable,role
v006,BMI_derived,outcome

Important: Variable will NOT be generated unless role contains "enabled".

Common errors:

# ❌ WRONG: Forgot "enabled"
uid,variable,role
v001,age,"predictor,table1"
# Result: age will NOT be generated

# ✅ CORRECT
uid,variable,role
v001,age,"enabled,predictor,table1"

Multi-role filtering example:

# Get only variables for Table 1
table1_vars <- variables[grepl("table1", variables$role), ]

# Get predictors
predictors <- variables[grepl("predictor", variables$role), ]

# Get enabled variables for generation
enabled_vars <- variables[grepl("enabled", variables$role), ]

position and seed: Generation order and reproducibility

Purpose: Control variable generation order and ensure reproducible independence

position:

Generation order (ascending)
Use increments of 10 (10, 20, 30…) for easy insertion
Lower numbers generated first

seed:

Random seed for this specific variable
Recommended: seed = position × 10 (e.g., position 20 → seed 200)
Prevents correlation artifacts between variables

Why position matters:

Some variables may depend on others being generated first (though MockData generally handles this automatically).

Why seed matters:

Using the same seed for all variables can create artificial correlations. Different seeds ensure statistical independence.

Examples:

# Good: Increments of 10, seed = position × 10
uid,variable,position,seed
v001,age,10,100
v002,smoking,20,200
v003,BMI,30,300

Common errors:

# ❌ WRONG: Same seed for all variables
uid,variable,position,seed
v001,age,10,123
v002,smoking,20,123
v003,BMI,30,123
# Result: Variables may be artificially correlated

# ❌ WRONG: Sequential positions (hard to insert)
uid,variable,position,seed
v001,age,1,10
v002,smoking,2,20
v003,BMI,3,30
# Result: Hard to insert new variable between age and smoking

Inserting variables:

# Original
uid,variable,position
v001,age,10
v003,BMI,30

# Insert smoking between age and BMI
uid,variable,position
v001,age,10
v002,smoking,20  # Fits perfectly
v003,BMI,30

sourceFormat vs sourceData

Purpose: Simulate raw data formats for harmonization pipeline testing

Column name: Currently sourceFormat in minimal-example (documentation shows sourceData)

Valid values:

Value	Output type	Simulates	Example
`analysis`	R Date object	Analysis-ready dates	`as.Date("2001-01-01")`
`csv`	Character string	CSV file dates	`"2001-01-01"`
`sas`	Numeric	SAS date numeric	`14975` (days since 1960-01-01)

Use case: Test date parsing/harmonization logic

Examples:

# Analysis-ready (default)
uid,variable,sourceFormat
v004,interview_date,analysis

# CSV import simulation
uid,variable,sourceFormat
v004,interview_date,csv

# SAS import simulation
uid,variable,sourceFormat
v004,interview_date,sas

Conversion examples:

# CSV to Date
dates_csv <- "2001-01-01"
as.Date(dates_csv)

# SAS to Date
dates_sas <- 14975
as.Date(dates_sas, origin = "1960-01-01")

See Working with date variables for detailed examples.

distribution: Distribution types by variable type

Purpose: Specify how values are distributed across valid ranges

For categorical variables: Use NA (categories defined by proportions in variable_details.csv)

For continuous variables:

Distribution	Parameters required	Use case	Example
`normal`	`mean`, `sd`	Age, BMI, normally-distributed measurements	Age: mean=50, sd=15
`uniform`	None (uses recStart range)	Equal probability across range	Income brackets, uniform codes

For date variables:

Distribution	Parameters required	Use case	Example
`uniform`	None	Index dates, enrollment dates	Interview date
`gompertz`	`rate`, `shape`, `followup_min`, `followup_max`, `event_prop`	Age-related events (death, dementia)	Mortality with increasing hazard by age
`exponential`	`rate`, `followup_min`, `followup_max`, `event_prop`	Constant hazard events	Loss to follow-up

Complete examples:

# Continuous: normal distribution
uid,variable,variableType,distribution,mean,sd
v001,age,Continuous,normal,50,15

# Continuous: uniform distribution
uid,variable,variableType,distribution
v010,income_bracket,Continuous,uniform

# Categorical: use NA (proportions in variable_details.csv)
uid,variable,variableType,distribution
v002,smoking,Categorical,

# Date: uniform (index date)
uid,variable,variableType,distribution
v004,interview_date,Date,uniform

# Date: gompertz survival (event date)
uid,variable,variableType,distribution,rate,shape,followup_min,followup_max,event_prop
v005,death_date,Date,gompertz,0.0001,0.1,365,7300,0.2

Common errors:

# ❌ WRONG: Normal distribution without mean/sd
uid,variable,distribution,mean,sd
v001,age,normal,,

# ❌ WRONG: Distribution for categorical variable
uid,variable,variableType,distribution
v002,smoking,Categorical,normal

# ✅ CORRECT
uid,variable,distribution,mean,sd
v001,age,normal,50,15
v002,smoking,,

Parameter interpretation:

Normal: Values drawn from N(mean, sd²), truncated to recStart range
Gompertz: Hazard increases exponentially with age (realistic mortality)
Exponential: Constant hazard over time (random censoring)

Distribution comparison table

Distribution	Shape	Median location	Skewness	Best for
normal	Bell curve	At mean	Symmetric	Age, BMI, height, normally-distributed measurements
uniform	Flat	Middle of range	Symmetric	Dates without temporal pattern, random assignment
gompertz	Right-skewed	Shifted toward end	Positive	Mortality, age-related disease onset
exponential	Right-skewed	Shifted toward start	Positive	Time to first event, loss to follow-up

Visual interpretation (for 2001-2020 date range):

uniform: Median ≈ 2010 (middle)
gompertz: Median ≈ 2015-2018 (later years, increasing hazard)
exponential: Median ≈ 2003-2007 (earlier years, constant hazard)

Frequently asked questions

General configuration

Q: What’s the difference between variables.csv and variable_details.csv?

A: variables.csv defines variable-level metadata (one row per variable). variable_details.csv defines detail-level specifications (multiple rows per variable for categories/ranges). Think of it as a one-to-many relationship.

Q: Can I use a single CSV file instead of two?

A: No. The two-file structure comes from recodeflow and is required. It enables:

Multiple category definitions per variable
Clean separation of variable-level vs category-level parameters
Standardized harmonization workflow integration

Q: Do I need to fill in all columns?

A: No. Only required columns need values. Optional columns should use:

NA for not applicable
Empty string for unknown (though NA preferred)
Sentinel values like [,] for empty ranges

Q: How do I know which columns are required?

A: See Quick reference table. Core required columns:

variables.csv: uid, variable, variableType
variable_details.csv: uid, uid_detail, variable, recStart

Variable types

Q: When should I use Categorical vs Continuous for numeric codes?

Categorical: Discrete codes with specific meanings (smoking status: 1=never, 2=former, 3=current)
Continuous: Numeric measurements on a scale (age, BMI, income)

Rule of thumb: If you would calculate a mean, use Continuous. If you would calculate proportions, use Categorical.

Q: Can I have a Continuous variable with only integer values?

A: Yes. Set variableType = "Continuous" and rType = "integer". Example: Age in years.

Q: What’s the difference between variableType and rType?

variableType: Conceptual type from recodeflow (Categorical or Continuous) - determines generation logic
rType: R output data type specified by MockData (integer, double, character, date, factor) - determines output format

Note: Date variables use variableType = "Continuous" and rType = "date". The variableType field comes from recodeflow and uses only Categorical/Continuous values.

Q: Should smoking status be factor or character?

A: Use rType = "factor" for categorical variables. Factors preserve category levels and enable proper statistical analysis.

Garbage data

Q: What’s the difference between garbage_low/garbage_high and prop_garbage?

A: Two modes:

Advanced (garbage_low/garbage_high): You specify exact invalid ranges
- Example: Age with garbage_low_range = "[-5,10]" generates negative ages
Simple (prop_garbage): MockData auto-generates invalid values
- Example: Categorical with prop_garbage = 0.05 gets invalid codes like 99, 999

Precedence: If you specify garbage_low_prop OR garbage_high_prop, those take precedence. Otherwise prop_garbage is used.

Q: Why would I want garbage data?

A: To test data quality pipelines:

Validate your data cleaning code catches impossible values
Train analysts on real-world data quality issues
Test edge cases in harmonization logic

Q: Can I have both low and high garbage data?

A: Yes. Specify both garbage_low_prop/garbage_low_range AND garbage_high_prop/garbage_high_range:

uid,variable,garbage_low_prop,garbage_low_range,garbage_high_prop,garbage_high_range
v003,BMI,0.02,"[-10,0]",0.01,"[60,150]"

Result: 2% have BMI < 0, 1% have BMI 60-150, rest have valid values.

Missing data

Q: What’s the difference between NA::a and NA::b?

A: Survey missing data classification:

NA::a (valid skip): Question not asked due to logic (respondent not eligible)
- Example: Pregnancy questions skipped for males
- Statistical treatment: Exclude from denominator
NA::b (missing response): Question asked but no valid answer (don’t know, refusal, not stated)
- Example: Respondent refused to answer income question
- Statistical treatment: Include in denominator, exclude from numerator

Q: Do missing codes count toward the proportion sum?

A: Yes. All proportions must sum to 1.0, including missing codes:

uid,recStart,recEnd,proportion
v002,1,1,0.50
v002,2,2,0.30
v002,3,3,0.17
v002,7,NA::b,0.03
# Sum = 1.00 ✓

Q: How do I generate missing data for continuous/date variables?

A: Use else in recStart with NA classification:

uid,variable,recStart,recEnd,proportion
v001,age,"[18,100]","copy",
v001,age,"else","NA::b",0.05

Result: 5% of age values will be NA (recoded from “else” = everything not in range).

Q: Can I use ranges for missing codes?

A: Yes:

# Map codes 997-999 to NA::b
uid,recStart,recEnd
v001,"[997,999]",NA::b

Distributions and parameters

Q: What happens if normal distribution generates values outside recStart range?

A: Values are truncated to the recStart range. Example:

uid,variable,distribution,mean,sd,recStart
v001,age,normal,50,15,"[18,100]"

Result: Normal distribution N(50, 15²) truncated to [18, 100]. No values < 18 or > 100.

Q: When should I use gompertz vs exponential for survival data?

Gompertz: Age-related events where hazard increases with time (mortality, dementia, chronic disease)
Exponential: Constant hazard events where risk doesn’t change with time (loss to follow-up, random censoring)

Q: What are typical gompertz parameters?

A: For mortality in elderly cohorts:

distribution,rate,shape,followup_min,followup_max,event_prop
gompertz,0.0001,0.1,365,7300,0.2

rate = 0.0001: Baseline hazard
shape = 0.1: Hazard acceleration (positive = increasing hazard with age)
followup_max = 7300: 20 years
event_prop = 0.2: 20% experience event

Q: Why do my date variables all have the same value?

A: Check:

Are you using distribution = "uniform"? (Required for variation)
Is recStart an interval [start,end] not a single date?
Did you set different seeds for each variable?

UIDs and foreign keys

Q: Must uid in variable_details.csv match uid in variables.csv exactly?

A: Yes. This is a foreign key relationship. Every uid in variable_details.csv must exist in variables.csv.

Q: Can I have multiple variables with the same uid?

A: No. UIDs must be unique within variables.csv. Use different UIDs for different variables.

Q: Can variable_details.csv have gaps in uid_detail numbering?

A: Yes. uid_detail values don’t need to be sequential, just unique:

uid_detail,variable
d001,age
d_003,age
d_042,age
# Valid - gaps are OK

Q: What happens if I reference a uid that doesn’t exist in variables.csv?

A: Validation error:

Error: uid 'v_999' in variable_details.csv not found in variables.csv

Proportions and validation

Q: How strict is the “proportions must sum to 1.0” rule?

A: Tolerance ±0.01. These are valid:

1.00 ✓
0.99 ✓
1.01 ✓
0.98 ❌ (too far from 1.0)

Q: Do garbage data proportions count in the sum?

A: No. Garbage data (garbage_low_prop, garbage_high_prop, prop_garbage) is separate from population proportions.

Example:

# Proportions in variable_details.csv must sum to 1.0
uid,recStart,proportion
v002,1,0.50
v002,2,0.30
v002,3,0.20
# Sum = 1.00 ✓

# Garbage data in variables.csv is additional
uid,prop_garbage
v002,0.05
# Result: 95% valid (distributed 50/30/20), 5% invalid codes

Q: What if I want unequal category probabilities?

A: Specify exact proportions in variable_details.csv:

uid,recStart,catLabel,proportion
v002,1,Never smoker,0.60
v002,2,Former smoker,0.25
v002,3,Current smoker,0.15
# Reflects real population distribution

Database filtering and multi-cycle data

Q: What is databaseStart and when do I need it?

A: Database/cycle identifier for filtering. Needed when:

Generating data for multiple survey cycles (CCHS 2001, 2005, 2009)
Variables have database-specific category codes
Testing multi-cycle harmonization

Q: How do I specify which databases a row applies to?

A: Use comma-separated list in databaseStart column (variable_details.csv):

uid,variable,databaseStart,recStart
v001,age,"cchs2001_p,cchs2005_p","[18,100]"
v002,smoking,cchs2001_p,1
v002,smoking,cchs2005_p,01

Row 1 applies to both databases. Rows 2-3 are database-specific (different codes for smoking).

Q: What if I’m only generating data for one database?

A: Use a single database name consistently:

uid,variable,databaseStart,recStart
v001,age,my_study,"[18,100]"
v002,smoking,my_study,1

Then generate with:

create_mock_data(
  databaseStart = "my_study",
  variables = variables,
  variable_details = variable_details
)

Troubleshooting

Validation errors

Error: “Proportions for variable ‘smoking’ sum to 0.97, expected 1.0”

Cause: Category proportions don’t sum to 1.0 (±0.01 tolerance)

Fix: Check proportions in variable_details.csv for that variable:

# Identify the problem
var_details %>%
  filter(variable == "smoking") %>%
  summarize(total = sum(proportion, na.rm = TRUE))

# Fix: Adjust proportions to sum to 1.0

Error: “uid ‘v_042’ in variable_details.csv not found in variables.csv”

Cause: Foreign key violation - referenced uid doesn’t exist

Fix: Check for typos or missing rows:

# Find orphaned uids
details_uids <- unique(variable_details$uid)
vars_uids <- unique(variables$uid)
orphans <- setdiff(details_uids, vars_uids)
print(orphans)

# Fix: Either add missing uid to variables.csv or fix typo in variable_details.csv

Error: “Required column ‘recStart’ not found in variable_details.csv”

Cause: Missing required column

Fix: Add the missing column:

# Check what columns exist
names(variable_details)

# Add missing column (with default values if needed)
variable_details$recStart <- NA

Error: “Invalid rType value ‘float’ for variable ‘BMI’”

Cause: rType must be one of: integer, double, factor, date, character, logical

Fix: Use "double" not "float":

# ❌ WRONG
uid,variable,rType
v003,BMI,float

# ✅ CORRECT
uid,variable,rType
v003,BMI,double

Error: “distribution ‘normal’ requires ‘mean’ and ‘sd’ parameters”

Cause: Normal distribution missing required parameters

Fix: Specify mean and sd:

# ❌ WRONG
uid,variable,distribution,mean,sd
v001,age,normal,,

# ✅ CORRECT
uid,variable,distribution,mean,sd
v001,age,normal,50,15

Generation issues

Problem: All date variables have the same value

Possible causes:

Used single date instead of interval in recStart
Forgot to specify distribution
Same seed for all variables

Fix:

# Check recStart uses interval notation
uid,variable,recStart
v004,interview_date,"[2001-01-01,2005-12-31]"

# Check distribution is specified
uid,variable,distribution
v004,interview_date,uniform

# Check different seeds
uid,variable,seed
v004,interview_date,400
v005,event_date,500

Problem: No variables generated (empty data frame)

Possible causes:

No variables have role = "enabled"
databaseStart filter excludes all rows
All variables are derived variables

Fix:

# Check which variables are enabled
enabled <- variables[grepl("enabled", variables$role), ]
nrow(enabled)  # Should be > 0

# Check databaseStart filtering
filtered <- variable_details[
  grepl(databaseStart, variable_details$databaseStart),
]
nrow(filtered)  # Should be > 0

Problem: Proportions don’t match expected distribution

Cause: Forgot to account for missing data proportion

Fix: Missing data proportion reduces valid category proportions:

# If you want 50% never smokers in VALID responses:
uid,recStart,recEnd,proportion
v002,1,1,0.475     # 50% of 95% = 47.5%
v002,2,2,0.285     # 30% of 95% = 28.5%
v002,3,3,0.19      # 20% of 95% = 19%
v002,7,NA::b,0.05  # 5% missing
# Sum = 1.00
# Result: Among VALID responses, 50/30/20 split

Problem: Garbage data not appearing

Possible causes:

Proportion too small to see in sample
Wrong column name (garbage_low vs corrupt_low)
Precedence issue (prop_garbage ignored if garbage_low_prop specified)

Fix:

# Increase proportion for testing
uid,variable,garbage_high_prop
v003,BMI,0.20  # 20% easier to verify than 1%

# Check correct column names (garbage_ not corrupt_)
uid,variable,garbage_low_prop,garbage_low_range
v001,age,0.05,"[-5,10]"

# If using advanced mode, don't specify prop_garbage
uid,variable,garbage_low_prop,garbage_low_range,prop_garbage
v001,age,0.05,"[-5,10]",  # Leave prop_garbage empty

Performance issues

Problem: Generation very slow for large n

Cause: Complex distributions (especially normal) are slower than uniform

Solutions:

Use uniform distribution where appropriate
Generate in batches
Simplify metadata (fewer variables, fewer categories)

# Batch generation example
batch_size <- 100000
n_batches <- 10

results <- lapply(1:n_batches, function(i) {
  create_mock_data(
    databaseStart = "my_study",
    variables = variables,
    variable_details = variable_details,
    n = batch_size,
    seed = 1000 + i  # Different seed per batch
  )
})

final_data <- bind_rows(results)

Complete examples

Example 1: Basic categorical variable with missing codes

Use case: Smoking status with standard survey missing codes

# variables.csv
uid,variable,label,variableType,rType,role,position,seed,prop_garbage
v002,smoking,Smoking status,Categorical,factor,"enabled,predictor",20,200,0.05

# variable_details.csv
uid,uid_detail,variable,recStart,recEnd,catLabel,proportion
v002,d_002,smoking,1,1,Never smoker,0.50
v002,d_003,smoking,2,2,Former smoker,0.30
v002,d_004,smoking,3,3,Current smoker,0.17
v002,d_005,smoking,7,NA::b,Don't know,0.03

Result: 50% never/30% former/17% current smokers, 3% don’t know (NA), plus 5% invalid codes (99, 999) from prop_garbage.

Example 2: Continuous variable with normal distribution and precise garbage ranges

Use case: Body mass index with biologically impossible values for QA testing

# variables.csv
uid,variable,label,variableType,rType,role,position,seed,garbage_low_prop,garbage_low_range,garbage_high_prop,garbage_high_range,distribution,mean,sd
v003,BMI,Body mass index,Continuous,double,"enabled,outcome",30,300,0.02,"[-10,0]",0.01,"[60,150]",normal,27.5,5.2

# variable_details.csv
uid,uid_detail,variable,recStart,recEnd,catLabel,proportion
v003,d_006,BMI,"[15,50]",copy,Valid BMI range,

Result: Normal distribution N(27.5, 5.2²) truncated to [15, 50], with 2% negative BMI values and 1% extremely high (60-150) values.

Example 3: Continuous variable with missing data

Use case: Age with survey missing codes and uniform distribution

# variables.csv
uid,variable,label,variableType,rType,role,position,seed
v001,age,Age in years,Continuous,integer,"enabled,predictor,table1",10,100

# variable_details.csv
uid,uid_detail,variable,recStart,recEnd,catLabel,proportion
v001,d001,age,"[18,100]",copy,Valid age range,0.90
v001,d_002,age,997,NA::b,Don't know,0.05
v001,d_003,age,998,NA::b,Refusal,0.03
v001,d_004,age,999,NA::b,Not stated,0.02

Result: 90% valid ages uniformly distributed 18-100, 10% missing with codes 997/998/999 (mapped to NA).

Example 4: Date variable (cohort entry/index date)

Use case: Interview date as time origin for survival analysis

# variables.csv
uid,variable,label,variableType,rType,role,position,seed,distribution,sourceFormat
v004,interview_date,Interview date,Date,date,"enabled,metadata",40,400,uniform,analysis

# variable_details.csv
uid,uid_detail,variable,recStart,recEnd,catLabel,proportion
v004,d_007,interview_date,"[2001-01-01,2005-12-31]",copy,Interview date range,1
v004,d_008,interview_date,else,NA::b,Missing interview date,0

Result: Uniform distribution of dates 2001-2005 (all respondents interviewed), output as R Date objects.

Example 5: Survival variable with competing risks (gompertz distribution)

Use case: Primary event (dementia diagnosis) with age-related hazard and temporal garbage data

# variables.csv
uid,variable,label,variableType,rType,role,position,seed,garbage_high_prop,garbage_high_range,distribution,rate,shape,followup_min,followup_max,event_prop
v005,primary_event_date,Primary event date (dementia),Date,date,"enabled,outcome,table1",50,500,0.03,"[2021-01-01,2099-12-31]",gompertz,0.0001,0.1,0,5475,0.1

# variable_details.csv
uid,uid_detail,variable,recStart,recEnd,catLabel,proportion
v005,d_009,primary_event_date,"[2002-01-01,2021-01-01]",copy,Primary event date range,0.1
v005,d_010,primary_event_date,else,NA::b,Missing event date,0

Interpretation:

10% experience dementia diagnosis within 0-15 years after interview
Event times follow gompertz distribution (increasing hazard with age)
3% have impossible future dates (2021-2099) for QA testing
90% censored (no event)

Example 6: Multi-cycle categorical variable with database-specific codes

Use case: Smoking status with different category codes across survey cycles

# variables.csv
uid,variable,label,variableType,rType,role,position,seed
v002,smoking,Smoking status,Categorical,factor,"enabled,predictor",20,200

# variable_details.csv
uid,uid_detail,variable,databaseStart,recStart,recEnd,catLabel,proportion
v002,d_002,smoking,cchs2001_p,1,1,Never smoker,0.50
v002,d_003,smoking,cchs2001_p,2,2,Former smoker,0.30
v002,d_004,smoking,cchs2001_p,3,3,Current smoker,0.20
v002,d_005,smoking,cchs2005_p,01,1,Never smoker,0.50
v002,d_006,smoking,cchs2005_p,02,2,Former smoker,0.30
v002,d_007,smoking,cchs2005_p,03,3,Current smoker,0.20

Result:

CCHS 2001: Codes 1/2/3 (numeric)
CCHS 2005: Codes 01/02/03 (zero-padded strings)
Same proportions, different source codes
Both map to harmonized values 1/2/3 (recEnd column)

Example 7: Derived variable (BMI from height and weight)

Use case: BMI calculated post-generation from height and weight

# variables.csv (height)
uid,variable,label,variableType,rType,role,position,seed,distribution,mean,sd
v004,height,Height in meters,Continuous,double,"enabled,predictor",40,400,normal,1.7,0.1

# variables.csv (weight)
uid,variable,label,variableType,rType,role,position,seed,distribution,mean,sd
v005,weight,Weight in kg,Continuous,double,"enabled,predictor",50,500,normal,75,15

# variables.csv (BMI_derived - NOT enabled)
uid,variable,label,variableType,rType,role,position,seed
v006,BMI_derived,BMI calculated from height and weight,Continuous,double,outcome,60,600

# variable_details.csv
uid,uid_detail,variable,recStart,recEnd,catLabel
v004,d_012,height,"[1.4,2.1]",copy,Valid height range (meters)
v005,d_014,weight,"[35,150]",copy,Valid weight range (kg)
v006,d_016,BMI_derived,"DerivedVar::[height, weight]","Func::bmi_fun",BMI calculated from height and weight

Generation workflow:

# 1. Generate height and weight
mock_data <- create_mock_data(
  databaseStart = "my_study",
  variables = variables,
  variable_details = variable_details,
  n = 1000
)
# Result: Contains height, weight (but NOT BMI_derived)

# 2. Calculate BMI_derived post-generation
bmi_fun <- function(height, weight) {
  ifelse(
    is.na(height) | is.na(weight) | height <= 0,
    NA_real_,
    weight / (height^2)
  )
}

mock_data$BMI_derived <- bmi_fun(mock_data$height, mock_data$weight)

Note: Derived variables are NOT generated by create_mock_data(). See Advanced topics: Derived variables for details.

Example 8: Complete survival analysis dataset with competing risks

Use case: Cohort study with primary event, death, and censoring

# variables.csv
uid,variable,label,variableType,rType,role,position,seed,distribution,rate,shape,followup_min,followup_max,event_prop
v004,interview_date,Interview date,Date,date,"enabled,metadata",40,400,uniform,,,,,,
v005,primary_event_date,Dementia diagnosis,Date,date,"enabled,outcome",50,500,gompertz,0.0001,0.1,0,5475,0.1
v006,death_date,Death date,Date,date,"enabled,outcome",60,600,gompertz,0.0001,0.1,365,7300,0.2
v_007,ltfu_date,Loss to follow-up,Date,date,"enabled,outcome",70,700,uniform,,,365,7300,0.1
v_008,admin_censor_date,Administrative censoring,Date,date,"enabled,metadata",80,800,,,,365,7300,1

# variable_details.csv
uid,uid_detail,variable,recStart,recEnd,proportion
v004,d_007,interview_date,"[2001-01-01,2005-12-31]",copy,1
v005,d_009,primary_event_date,"[2002-01-01,2021-01-01]",copy,0.1
v006,d_011,death_date,"[2002-01-01,2024-12-31]",copy,0.2
v_007,d_013,ltfu_date,"[2002-01-01,2024-12-31]",copy,0.1
v_008,d_015,admin_censor_date,2024-12-31,copy,1

Interpretation:

Index date: Interview 2001-2005 (100% have date)
Primary event: 10% develop dementia within 0-15 years (gompertz hazard)
Competing risk: 20% die within 1-20 years (gompertz hazard)
Censoring: 10% lost to follow-up within 1-20 years (uniform)
Administrative: All censored at 2024-12-31

Result: Realistic competing risks dataset with:

Some experience primary event before death/censoring
Some die before primary event
Some censored (lost to follow-up or administrative)
No individual can have more than one terminal event

See Tutorial: Generating survival data with competing risks for complete workflow.

Example 9: Categorical variable with skip logic (NA::a)

Use case: Pregnancy question with valid skip for males/postmenopausal

# variables.csv
uid,variable,label,variableType,rType,role,position,seed
v010,currently_pregnant,Currently pregnant,Categorical,factor,"enabled,outcome",100,1000

# variable_details.csv
uid,uid_detail,variable,recStart,recEnd,catLabel,proportion
v010,d_020,currently_pregnant,1,1,Yes,0.05
v010,d_021,currently_pregnant,2,2,No,0.30
v010,d_022,currently_pregnant,6,NA::a,Valid skip (not applicable),0.60
v010,d_023,currently_pregnant,9,NA::b,Not stated,0.05

Result:

5% currently pregnant
30% not pregnant
60% valid skip (males, postmenopausal females - not eligible for question)
5% eligible but didn’t answer

Statistical interpretation:

Denominator for prevalence: 0.05 + 0.30 + 0.05 = 0.40 (eligible respondents)
Prevalence among eligible: 0.05 / 0.40 = 12.5%
Response rate: (0.05 + 0.30) / 0.40 = 87.5%

Example 10: Versioned metadata with garbage data evolution

Use case: Tracking changes to QA testing parameters over time

# variables.csv (version 1.0.0)
uid,variable,garbage_low_prop,garbage_low_range,mockDataVersion,mockDataVersionNotes
v001,age,0.01,"[-5,10]",1.0.0,Initial version with minimal contamination

# variables.csv (version 1.1.0 - increased QA testing)
uid,variable,garbage_low_prop,garbage_low_range,mockDataVersion,mockDataLastUpdated,mockDataVersionNotes
v001,age,0.05,"[-10,15]",1.1.0,2025-11-15,Increased contamination for comprehensive QA testing

Use case:

Track metadata evolution
Document why garbage data proportions changed
Reproduce exact mock datasets from specific versions
Audit trail for scientific publications

Edge cases and special configurations

Single-value categorical variable

Use case: Binary variable with only two categories

uid,uid_detail,variable,recStart,recEnd,proportion
v020,d_040,diabetes,0,0,0.90
v020,d_041,diabetes,1,1,0.10

Result: 90% no diabetes (0), 10% diabetes (1).

Administrative date (all same value)

Use case: Data freeze date - everyone has same value

# variables.csv
uid,variable,variableType,rType,distribution
v_008,admin_censor_date,Date,date,

# variable_details.csv
uid,uid_detail,variable,recStart,recEnd,proportion
v_008,d_015,admin_censor_date,2024-12-31,copy,1

Result: All rows have 2024-12-31 (single date, not a range).

Continuous variable with missing data using “else”

Use case: Height with catch-all missing pattern

uid,uid_detail,variable,recStart,recEnd,proportion
v004,d_012,height,"[1.4,2.1]",copy,
v004,d_013,height,else,NA::b,0.02

Result: 98% valid heights [1.4, 2.1], 2% coded as else → NA.

Zero-variance continuous variable (for testing)

Use case: Constant value for all observations

# variables.csv
uid,variable,distribution,mean,sd
v030,constant_value,normal,100,0.001

# variable_details.csv
uid,uid_detail,variable,recStart
v030,d_060,constant_value,"[99.99,100.01]"

Result: All values ≈ 100 (effectively constant given narrow range and tiny SD).

Migration from older versions

From recodeflow metadata (no MockData extensions)

If you have existing recodeflow metadata without MockData extension columns:

1. Add required extension columns to variables.csv:

# Minimal additions
variables$rType <- "double"  # Or appropriate type
variables$role <- "enabled"
variables$position <- seq(10, by = 10, length.out = nrow(variables))
variables$seed <- variables$position * 10

2. Add distribution columns (if needed):

# For continuous variables
continuous_vars <- variables$variableType == "Continuous"
variables$distribution[continuous_vars] <- "uniform"

# For date variables
date_vars <- variables$variableType == "Date"
variables$distribution[date_vars] <- "uniform"

3. Add recEnd to variable_details.csv:

# For continuous/date variables with ranges
variable_details$recEnd[is.na(variable_details$recEnd)] <- "copy"

# For categorical variables (map codes to themselves)
# Requires manual review to identify missing codes

4. Validate and test:

# Load and validate
mock_data <- create_mock_data(
  databaseStart = "your_database",
  variables = variables,
  variable_details = variable_details,
  n = 100
)

# Verify structure
str(mock_data)
summary(mock_data)

From MockData v0.1.x to v0.2.x

Key changes in v0.2.0:

1. Garbage data column names changed:

# OLD (v0.1.x)
corrupt_low_prop,corrupt_low_range,corrupt_high_prop,corrupt_high_range

# NEW (v0.2.x)
garbage_low_prop,garbage_low_range,garbage_high_prop,garbage_high_range

Migration:

# Rename columns
names(variables)[names(variables) == "corrupt_low_prop"] <- "garbage_low_prop"
names(variables)[names(variables) == "corrupt_low_range"] <- "garbage_low_range"
names(variables)[names(variables) == "corrupt_high_prop"] <- "garbage_high_prop"
names(variables)[names(variables) == "corrupt_high_range"] <- "garbage_high_range"

2. Distribution parameters moved to variables.csv:

In v0.1.x, some parameters were in variable_details.csv. In v0.2.x, all distribution parameters are in variables.csv.

3. New versioning columns available:

mockDataVersion,mockDataLastUpdated,mockDataVersionNotes

Add these to track metadata changes over time.

Performance optimization

For large datasets (n > 1M)

1. Use simpler distributions:

# Slower
distribution,mean,sd
normal,50,15

# Faster
distribution
uniform

2. Minimize variables:

Only include variables you actually need. Each variable adds generation time.

3. Batch generation:

# Generate in chunks
batch_size <- 100000
n_batches <- ceiling(total_n / batch_size)

results <- lapply(1:n_batches, function(i) {
  create_mock_data(
    databaseStart = "my_study",
    variables = variables,
    variable_details = variable_details,
    n = min(batch_size, total_n - (i-1)*batch_size),
    seed = base_seed + i
  )
})

final_data <- bind_rows(results)

4. Simplify metadata:

Reduce number of categories per variable
Use uniform instead of normal distributions
Minimize garbage data (only for QA testing, not production)

Memory management

For n > 10M rows:

# Generate in batches and write to disk
for (i in 1:n_batches) {
  batch <- create_mock_data(
    databaseStart = "my_study",
    variables = variables,
    variable_details = variable_details,
    n = batch_size,
    seed = base_seed + i
  )

  # Write to disk immediately
  write.csv(batch, paste0("mock_data_batch_", i, ".csv"), row.names = FALSE)

  # Free memory
  rm(batch)
  gc()
}

# Combine later if needed
all_files <- list.files(pattern = "mock_data_batch_.*\\.csv")
combined <- do.call(rbind, lapply(all_files, read.csv))

Reference implementation

See inst/extdata/minimal-example/ for a complete working example with 11 variables (Categorical : 1, Continuous : 10) and 26 detail rows demonstrating:

All 20 variable-level extension columns (plus 4 core recodeflow columns)
All 3 detail-level extension columns (plus 4 core recodeflow columns)
All variable types (integer, factor, double, date)
All garbage data patterns (advanced low/high, simple prop_garbage)
All distribution types (normal, uniform, gompertz)
Survival analysis with competing risks
UID-based foreign keys
Interval notation throughout
Complete grid principle (no empty strings, explicit NA/sentinels)

This example validates successfully and can be used as a template for new projects.

Quick reference

Overview

File: variables.csv

Core columns (from recodeflow)

Extension columns (MockData-specific)

Type and generation control

Garbage data (data quality testing)

Distribution parameters (continuous and date variables)

Survival parameters (date variables with events)

Versioning

Complete example

File: variable_details.csv

Core columns (from recodeflow)

Extension columns (MockData-specific)

Complete example

recStart syntax

Single values (categorical variables)

Interval notation (continuous and date variables)

recEnd for missing data classification

Valid response codes

Skip codes: NA::a

Missing codes: NA::b

Continuous and date variables

Complete example with missing data

Why recEnd is required

Proportions

Basic rules

Categorical variables

Continuous and date variables

With missing codes (categorical)

Validation rules

UID validation

Column validation

Value validation

Detailed column reference

uid and uid_detail: Unique identifiers

role: Multi-valued variable roles

position and seed: Generation order and reproducibility

sourceFormat vs sourceData

distribution: Distribution types by variable type

Distribution comparison table

Frequently asked questions

General configuration

Variable types

Garbage data

Missing data

Distributions and parameters

UIDs and foreign keys

Proportions and validation

Database filtering and multi-cycle data

Troubleshooting

Validation errors

Generation issues

Performance issues

Complete examples

Example 1: Basic categorical variable with missing codes

Example 2: Continuous variable with normal distribution and precise garbage ranges

Example 3: Continuous variable with missing data

Example 4: Date variable (cohort entry/index date)

Example 5: Survival variable with competing risks (gompertz distribution)

Example 6: Multi-cycle categorical variable with database-specific codes

Example 7: Derived variable (BMI from height and weight)

Example 8: Complete survival analysis dataset with competing risks

Example 9: Categorical variable with skip logic (NA::a)

Example 10: Versioned metadata with garbage data evolution

Edge cases and special configurations

Single-value categorical variable

Administrative date (all same value)

Continuous variable with missing data using “else”

Zero-variance continuous variable (for testing)

Schema design principles

Core principles

Best practices checklist

Migration from older versions

From recodeflow metadata (no MockData extensions)

From MockData v0.1.x to v0.2.x

Performance optimization

For large datasets (n > 1M)

Memory management