About this vignette: All numeric values shown in this vignette are computed from the actual DemPoRT metadata files. Code is hidden by default for readability, but you can view the source .qmd file to see how values are calculated.
Overview
This vignette demonstrates generating mock data for DemPoRT (Dementia Population Risk Tool), a survival analysis model predicting dementia risk. DemPoRT uses harmonized data from multiple Canadian Community Health Survey (CCHS) cycles spanning 2001-2018. Dementia incidence (onset) is individually linked using administrative data housed at ICES – along with mortality records and other censoring information, where it is combined with CCHS survey data.
The MockData workflow for multi-cycle projects:
- Generate separate mock datasets for individual cycles/databases (e.g., CCHS 2001, CCHS 2003)
- Hand off to analyst to test harmonization pipeline (using recodeflow/cchsflow)
- Harmonize and bind the individual datasets using your analysis code
- Calculate derived variables after harmonization
MockData generates raw, unharmonized data that mimics individual source databases. It does NOT pool or bind data—that is the analyst’s responsibility to test.
What you’ll learn:
- Generate mock data for individual survey cycles
- Understand multi-cycle metadata structure
- Test harmonization workflows with mock data
- Validate metadata consistency
- Generate survival analysis variables (dates, events)
Prerequisites:
- Getting started - Basic mock data generation
- Configuration files - Batch generation approach
- Date variables - Temporal data and survival analysis
Setup
Load required packages:
Required inputs
Typically, you need three configuration files to generate mock data:
-
Mock data config file (
mock_data_config.csv): Specifies which variables to generate and their roles -
Variables file (
variables.csv): Defines variable metadata (labels, types, cycles) -
Variable details file (
variable_details.csv): Specifies distributions and categories
Note: The mock data config file is not yet complete for DemPoRT. This example demonstrates generating mock data directly from harmonization metadata (variables and variable_details files). We will update this vignette once the mock data config is available.
For DemPoRT, the harmonization metadata describes 20+ individual CCHS cycles (2001-2018):
# Load DemPoRT metadata
variables <- read.csv(
system.file("extdata/demport/variables_DemPoRT.csv", package = "MockData"),
header = TRUE,
check.names = FALSE,
na.strings = c("", "NA", "N/A"),
stringsAsFactors = FALSE
)
variable_details <- read.csv(
system.file("extdata/demport/variable_details_DemPoRT.csv", package = "MockData"),
header = TRUE,
check.names = FALSE,
na.strings = c("", "NA", "N/A"),
stringsAsFactors = FALSE
)The DemPoRT metadata includes 74 variables with 669 detail rows. First 5 variables: ADL_01, ADL_02, ADL_03, ADL_04, ADL_05.
Variable types in the metadata:
- Categorical: Sex, education, marital status, immigration status
- Continuous: Age, BMI, alcohol consumption
- Date: Birth date, interview date (baseline), death date (outcome)
- Derived: Age groups, BMI categories (calculated from raw variables)
Important: The metadata describes harmonized target variables, not raw source data. MockData uses this metadata to generate mock data that mimics raw source databases before harmonization.
Multi-cycle structure: Each variable’s metadata includes a databaseStart column listing which cycles contain that variable (e.g., “cchs2001_p, cchs2003_p, cchs2005_p”).
Generate mock data
The simplest way to generate DemPoRT mock data is using create_mock_data(). This function reads the metadata files and generates all variables at once.
# Generate mock data for all DemPoRT raw variables
# Derived variables are automatically excluded
mock_demport <- create_mock_data(
config_path = system.file("extdata/demport/variables_DemPoRT.csv", package = "MockData"),
details_path = system.file("extdata/demport/variable_details_DemPoRT.csv", package = "MockData"),
n = 100,
seed = 123
)
# View structure
dim(mock_demport)
# [1] 100 69
# First few columns
names(mock_demport)[1:5]
# [1] "ADL_01" "ADL_02" "ADL_03" "ADL_04" "ADL_05"
# Show sample of first 3 rows, first 5 columns
head(mock_demport[, 1:5], 3)
# ADL_01 ADL_02 ADL_03 ADL_04 ADL_05
# 1 No difficulty No difficulty No help No help No help
# 2 Some difficulty No difficulty No help No help Some help
# 3 No difficulty No difficulty No help No help No helpImportant: Derived variables are NOT generated
MockData automatically excludes derived variables during generation. Derived variables (e.g., ADL_der, HWTGBMI_der, pack_years_der) are calculated from raw variables and should NOT be generated as mock data.
How it works:
- Variables with
role = "derived,enabled"are automatically filtered out - Only raw variables (e.g.,
ADL_01,HWTGHTM,HWTGWTK) are generated. This includes raw variables required to calculate derived variables. - You calculate derived variables using your analysis pipeline after generation.
Why this matters:
- Consistency: Derived variables use the same calculation logic as your real analysis
- Testing: Test your derived variable calculations on mock data
- Correctness: Prevents generating fake derived values that don’t match raw inputs
Understanding multi-cycle data
DemPoRT harmonizes and pools data from multiple CCHS cycles. The harmonization metadata specifies which cycles contain each variable via the databaseStart column.
Key concept: MockData generates individual cycle datasets (e.g., mock CCHS 2001, mock CCHS 2003). The analyst then:
- Harmonizes each cycle separately using recodeflow/cchsflow
- Binds the harmonized cycles together
- Calculates derived variables on the pooled dataset
Extract cycle information
DemPoRT harmonizes 26 CCHS cycles ranging from 2001 to 2017-2018. The cycles are: cchs2001_i, cchs2001_p, cchs20013_2014_i, cchs2003_i, cchs2003_p, cchs2005_i, cchs2005_p, cchs2007_2008_i, cchs2007_2008_p, cchs2009_2010_i, cchs2009_2010_p, cchs2009_s, cchs2010_p, cchs2010_s, cchs2011_2012_i, cchs2011_2012_p, cchs2012_p, cchs2012_s, cchs2013_2014_i, cchs2013_2014_p, cchs2014_p, cchs2015_2016_i, cchs2015_2016_p, cchs2017_2018, cchs2017_2018_i, cchs2017_2018_p.
Interpretation:
- Each variable may appear in different subsets of cycles
- Example:
EDU_04(education) appears in cycles 2001-2014, but not 2015-2018
Check variable coverage by cycle
The EDU_04 (education) variable is available in 0 cycles. First 3: .
Why this matters for MockData:
- Generate separate mock datasets for each cycle you want to test
- Each cycle will have different variables based on
databaseStartmetadata - Test your harmonization pipeline on individual cycles before pooling
- Missing data patterns may differ by cycle
Validate metadata consistency
Before generating mock data, it’s good practice to check that your config and details files are aligned.
The metadata includes 74 variables in config and 80 variables in details. Warning: 6 variables missing details. Warning: 12 extra variables in details.
Common issues this catches:
- Typos in variable names
- Variables defined in config but missing implementation details
- Orphaned details rows from deleted variables
- Helpful for troubleshooting
create_mock_data()errors
Classify variables by type
Understanding variable types helps organize analysis and identify which variables need special handling.
Variable type summary:
- Derived: 7
- Categorical: 47
- Continuous: 20
- Total: 74
Classification accounts for all variables.
About derived variables:
- Examples:
age_cat(age groups),bmi_cat(BMI categories) - These are calculated from raw variables (age, BMI) after harmonization
- Should not be generated as raw mock data
- In real workflows, you’d calculate them using recodeflow after generating raw variables
Examine data quality for individual cycles
After generating mock data for an individual cycle, check variable completeness and data types.
Remember: You’re examining one cycle’s raw data here, not pooled data. Each cycle may have different variables and different data patterns.
# Example: Check mock data for CCHS 2001
# (Replace mock_demport with your cycle-specific dataset, e.g., mock_2001)
# Check for date variables
date_vars <- names(mock_demport)[sapply(mock_demport, inherits, "Date")]
cat("Date variables in this cycle:\n")
if (length(date_vars) > 0) {
for (var in date_vars) {
date_range <- range(mock_demport[[var]], na.rm = TRUE)
n_missing <- sum(is.na(mock_demport[[var]]))
cat(sprintf(" %s: %s to %s (%d missing)\n",
var,
as.character(date_range[1]),
as.character(date_range[2]),
n_missing))
}
} else {
cat(" (No date variables in this cycle)\n")
}
# Check for factor variables
factor_vars <- names(mock_demport)[sapply(mock_demport, is.factor)]
cat("\nFactor variables:", length(factor_vars), "\n")
if (length(factor_vars) > 0) {
cat(" First 3:", paste(head(factor_vars, 3), collapse = ", "), "\n")
# Show levels for first factor
cat(sprintf(" %s levels: %s\n",
factor_vars[1],
paste(levels(mock_demport[[factor_vars[1]]]), collapse = ", ")))
}
# Example output:
# Factor variables: 8
# First 3: DHH_SEX, EDU_04, SMK_202
# DHH_SEX levels: Male, Female
# Check for numeric variables
numeric_vars <- names(mock_demport)[sapply(mock_demport, is.numeric)]
cat("\nNumeric variables:", length(numeric_vars), "\n")
if (length(numeric_vars) > 0) {
cat(" First 3:", paste(head(numeric_vars, 3), collapse = ", "), "\n")
}
# Example output:
# Numeric variables: 12
# First 3: ADL_01, ADL_02, HWTDHTMPractical workflow: Generate individual cycle datasets
MockData generates separate datasets for individual cycles. You do NOT bind them together in MockData - that’s the analyst’s job to test harmonization.
Recommended workflow:
Step 1: Filter metadata for each cycle
# Example: Create cycle-specific metadata files
# Filter variables and details by databaseStart column
# CCHS 2001
vars_2001 <- variables %>%
filter(str_detect(databaseStart, "cchs2001"))
details_2001 <- variable_details %>%
filter(variable %in% vars_2001$variable)
# Save cycle-specific metadata
write.csv(vars_2001, "demport_variables_2001.csv", row.names = FALSE)
write.csv(details_2001, "demport_details_2001.csv", row.names = FALSE)
# Repeat for other cycles...Step 2: Generate mock data for each cycle
# Generate separate mock dataset for CCHS 2001
mock_2001 <- create_mock_data(
config_path = "demport_variables_2001.csv",
details_path = "demport_details_2001.csv",
n = 1000,
seed = 2001 # Cycle-specific seed for reproducibility
)
# Generate separate mock dataset for CCHS 2003
mock_2003 <- create_mock_data(
config_path = "demport_variables_2003.csv",
details_path = "demport_details_2003.csv",
n = 1000,
seed = 2003
)
# MockData STOPS here - hand off to analystStep 3: Test harmonization (analyst’s job)
After generating individual cycle datasets, you’ll harmonize them using your harmonization pipeline (e.g., recodeflow/cchsflow), bind the harmonized cycles together, and calculate derived variables. Good luck!
Key principles:
- MockData generates: Individual raw cycle datasets (separate files/objects)
- Analyst harmonizes: Each cycle using recodeflow/cchsflow
- Analyst binds: Harmonized cycles together for analysis
- Analyst calculates: Derived variables on pooled data
Why separate cycles matter:
- Tests harmonization logic on realistic raw data structure
- Allows cycle-specific sample sizes (CCHS 2001 = 130K, CCHS 2017 = 110K)
- Enables testing cycle-specific missing patterns
- Matches real workflow: harmonize first, then pool
Survival analysis dates
DemPoRT is a survival model predicting time to dementia diagnosis. This section demonstrates how to generate realistic survival dates with proper temporal ordering and competing risks.
Date variables for survival analysis
The table below shows the date variables in temporal order. The survival analysis focuses on the three key dates: interview (cohort entry), dementia onset (primary outcome), and death (competing risk). Birth date is included in the metadata for age calculation but is not required for the survival analysis itself.
| Variable | Role in analysis | Date range | Temporal constraint |
|---|---|---|---|
birth_date |
Age calculation (not used in survival analysis) | 1920-1987 | birth < interview |
interview_date |
Cohort entry (t=0) | 2001-2005 | Baseline for all follow-up |
dementia_onset_date |
Primary outcome | 2001-2017 | interview < dementia < death |
death_date |
Competing risk | 2001-2017 | interview < death |
Administrative censoring: End of follow-up is 2017-03-31. Individuals alive at this date are censored.
Step 1: Generate survival dates with temporal ordering
The create_survival_dates() function generates paired entry and event dates with guaranteed temporal ordering. This must be called separately from create_mock_data() because it processes TWO variables together.
# Extract metadata for interview (cohort entry) and death (competing risk)
interview_details <- variable_details %>%
filter(variable == "interview_date")
death_details <- variable_details %>%
filter(variable == "death_date")
interview_row <- variables %>%
filter(variable == "interview_date")
death_row <- variables %>%
filter(variable == "death_date")
# Generate 1000 individuals with guaranteed temporal ordering
survival_dates <- create_survival_dates(
entry_var_row = interview_row,
entry_details_subset = interview_details,
event_var_row = death_row,
event_details_subset = death_details,
n = 1000,
seed = 456,
df_mock = data.frame()
)Verify temporal ordering and date ranges:
| Variable | Min | Max | Expected | Temporal ordering |
|---|---|---|---|---|
| interview_date | 2001-01-09 | 2005-12-31 | 2001-2005 | Baseline (t=0) |
| death_date | 2002-07-07 | 2015-10-26 | 2001-2017 | ✓ All >= interview (n=1000) |
Step 2: Add competing risks (dementia onset)
Real survival studies track multiple outcomes. In DemPoRT, two mutually exclusive events can occur:
- Dementia diagnosis (primary outcome)
- Death without dementia (competing risk)
We’ll simulate dementia occurring between interview and death for 10% of the cohort:
# Simulate dementia incidence (10% develop dementia)
set.seed(789)
survival_dates$dementia_occurred <- rbinom(
n = nrow(survival_dates),
size = 1,
prob = 0.10
)
# For those with dementia, set dementia_onset_date between interview and death
survival_dates$dementia_onset_date <- ifelse(
survival_dates$dementia_occurred == 1,
survival_dates$interview_date +
runif(nrow(survival_dates)) *
(survival_dates$death_date - survival_dates$interview_date),
NA
)
# Convert back to Date class
survival_dates$dementia_onset_date <- as.Date(
survival_dates$dementia_onset_date,
origin = "1970-01-01"
)Step 3: Add administrative censoring
Not everyone is followed to death in real cohort studies. We apply administrative censoring at 2017-03-31 (end of follow-up):
# Apply censoring: individuals alive after 2017-03-31 are censored
censor_date <- as.Date("2017-03-31")
survival_dates$censored <- survival_dates$death_date > censor_date
# Set observed death date (censored individuals get censor_date)
survival_dates$death_date_observed <- ifelse(
survival_dates$censored,
censor_date,
survival_dates$death_date
)
survival_dates$death_date_observed <- as.Date(
survival_dates$death_date_observed,
origin = "1970-01-01"
)Step 4: Calculate time-to-event and final outcomes
# Calculate follow-up time to first event (years)
survival_dates$time_to_event <- pmin(
as.numeric(difftime(survival_dates$dementia_onset_date,
survival_dates$interview_date,
units = "days")) / 365.25,
as.numeric(difftime(survival_dates$death_date_observed,
survival_dates$interview_date,
units = "days")) / 365.25,
na.rm = TRUE
)
# Event type: 0=censored, 1=dementia, 2=death without dementia
survival_dates$event_type <- ifelse(
!is.na(survival_dates$dementia_onset_date), 1,
ifelse(!survival_dates$censored, 2, 0)
)Summary: Cohort characteristics
| Outcome | Event code | N | Percent | Mean follow-up (years) | Temporal ordering verified |
|---|---|---|---|---|---|
| Censored | 0 | 0 | 0.0% | — | ✓ All interview < censor (n=0) |
| Dementia | 1 | 102 | 10.2% | 3.0 | ✓ All interview < dementia < death (n=102) |
| Death | 2 | 898 | 89.8% | 5.6 | ✓ All interview < death (n=898) |
The simulated cohort demonstrates realistic survival analysis patterns:
- Temporal ordering: All dates satisfy the constraint interview < event < death
- Competing risks: Individuals experience either dementia or death, not both
- Censoring: Individuals alive at study end (2017-03-31) are censored
- Follow-up time: Varies by outcome, with censored individuals having longest follow-up
Prepare for Cox regression or competing risks analysis
The dataset is now ready for survival analysis:
# Example: Cox proportional hazards for dementia
# library(survival)
#
# model_dementia <- coxph(
# Surv(time_to_event, event_type == 1) ~ age + sex + education,
# data = survival_dates
# )
#
# # Example: Competing risks analysis
# library(cmprsk)
#
# cr_fit <- cuminc(
# ftime = survival_dates$time_to_event,
# fstatus = survival_dates$event_type,
# cencode = 0
# )Key takeaways:
-
create_survival_dates()generates entry and event dates with guaranteed temporal ordering - Competing risks (dementia vs death) require additional simulation between entry and event
- Administrative censoring is applied based on study design (end date)
- All temporal constraints are verified before analysis
- The dataset is ready for Cox regression or competing risks models
See also: Date variables tutorial for more on date generation and temporal constraints.
Visualize mock data (optional)
After generating mock data, you can visualize distributions to verify data quality. For example, create histograms for continuous variables (height, weight) or bar charts for categorical variables (sex, education).
# Example: Check a continuous variable distribution
hist(mock_demport$HWTDHTM, # Height in cm
breaks = 20,
main = "Height Distribution",
xlab = "Height (cm)",
col = "lightblue")
# Example: Check a categorical variable distribution
barplot(table(mock_demport$DHH_SEX),
main = "Sex Distribution",
col = "lightgreen")These visualizations help you verify that generated values fall within expected ranges and categories match your metadata specifications.
Export mock data
Save your generated mock data for use in testing your analysis pipeline. You can export to CSV for broad compatibility or RDS to preserve R-specific data types (Date objects, factor levels).
After running these commands, you’ll have two files saved to your working directory: demport_mock_data.csv and demport_mock_data.rds.
What you learned
This vignette demonstrated:
-
Batch generation with
create_mock_data()for complex multi-variable datasets - Multi-cycle data concepts and why they matter for survey harmonization
- Metadata validation to catch inconsistencies before generation
- Variable classification to distinguish raw vs. derived variables
- Data quality checks to verify generated data meets expectations
- Survival analysis setup including date variables and event indicators
Next steps
Core concepts:
- Getting started - If you haven’t read this yet
- Configuration files - Deep dive on metadata format
- Date variables - Advanced temporal data and survival workflows
Real-world examples:
- CCHS example - Canadian Community Health Survey
- CHMS example - Canadian Health Measures Survey
Advanced topics:
- Configuration reference - Complete metadata specification
- Advanced topics - Performance optimization and integration
For DemPoRT users:
- Apply mock data in analysis pipelines
- Test harmonization code before accessing real data
- Validate survival model implementation
- Generate sample datasets for method development