About this vignette: All numeric values shown in this vignette are computed from the actual CHMS sample metadata files. Code is hidden by default for readability, but you can view the source .qmd file to see how values are calculated.
Overview
The Canadian Health Measures Survey (CHMS) data exists only in secure data environments. This example demonstrates how to generate mock CHMS data using chmsflow metadata for testing harmonization workflows before accessing real data.
Load metadata
Load the CHMS harmonization metadata files that define variables and their details across cycles.
# CHMS sample variables
variables <- read.csv(
system.file("extdata/chms/variables_chmsflow_sample.csv", package = "MockData"),
header = TRUE,
check.names = FALSE,
na.strings = c("", "NA", "N/A"),
stringsAsFactors = FALSE
)
# CHMS sample variable details
variable_details <- read.csv(
system.file("extdata/chms/variable_details_chmsflow_sample.csv", package = "MockData"),
header = TRUE,
check.names = FALSE,
na.strings = c("", "NA", "N/A"),
stringsAsFactors = FALSE
)The CHMS sample metadata includes 18 harmonized variables with 73 detail rows.
Extract available cycles
The sample metadata includes 8 CHMS cycles: cycle2, cycle2_meds, cycle3, cycle4, cycle5, cycle6, cycle1, cycle1_meds.
Understanding the metadata
The sample metadata includes 20 harmonized variables. For cycle2, there are 15 harmonized variables available:
| variable | variable_raw | variableType | label |
|---|---|---|---|
| alc_11 | alc_11 | Categorical | Drank in past year |
| alc_17 | alc_17 | Categorical | Ever drank alcohol |
| alc_18 | alc_18 | Categorical | Drank alcohol regularly |
| alcdwky | alcdwky | Continuous | Drinks in week |
| ammdmva1 | ammdmva1 | Continuous | Minutes of exercise per day (accelerometer Day 1) |
| ammdmva2 | ammdmva2 | Continuous | Minutes of exercise per day (accelerometer Day 2) |
| ammdmva3 | ammdmva3 | Continuous | Minutes of exercise per day (accelerometer Day 3) |
| ammdmva4 | ammdmva4 | Continuous | Minutes of exercise per day (accelerometer Day 4) |
| ammdmva5 | ammdmva5 | Continuous | Minutes of exercise per day (accelerometer Day 5) |
| ammdmva6 | ammdmva6 | Continuous | Minutes of exercise per day (accelerometer Day 6) |
Get raw variables to generate
For mock data generation, we need to create the raw source variables (before harmonization), not the harmonized variables.
# Get unique raw variables needed for this cycle (excludes derived variables)
raw_vars <- get_raw_variables(example_cycle, variables, variable_details,
include_derived = FALSE)This cycle requires 15 unique raw variables: 4 categorical and 11 continuous.
| variable_raw | variableType | harmonized_vars | n_harmonized |
|---|---|---|---|
| alc_11 | Categorical | alc_11 | 1 |
| alc_17 | Categorical | alc_17 | 1 |
| alc_18 | Categorical | alc_18 | 1 |
| alcdwky | Continuous | alcdwky | 1 |
| ammdmva1 | Continuous | ammdmva1 | 1 |
| ammdmva2 | Continuous | ammdmva2 | 1 |
| ammdmva3 | Continuous | ammdmva3 | 1 |
| ammdmva4 | Continuous | ammdmva4 | 1 |
| ammdmva5 | Continuous | ammdmva5 | 1 |
| ammdmva6 | Continuous | ammdmva6 | 1 |
| ammdmva7 | Continuous | ammdmva7 | 1 |
| bpmdpbpd | Continuous | bpmdpbpd | 1 |
| bpmdpbps | Continuous | bpmdpbps | 1 |
| clc_age | Continuous | clc_age | 1 |
| clc_sex | Categorical | clc_sex | 1 |
Generate mock data for one cycle
Now let’s generate mock data for a single cycle.
# Configuration
n_records <- 100
target_cycle <- example_cycle
seed <- 12345
# Initialize data frame
df_mock <- data.frame(id = 1:n_records)We’ll generate 100 mock records for cycle2.
Generate categorical variables
Generated 4 categorical variables. Data frame now has 5 columns.
Generate continuous variables
Generated 11 continuous variables. Final data frame has 16 columns.
Examine the result
Mock data structure:
'data.frame': 100 obs. of 16 variables:
$ id : int 1 2 3 4 5 6 7 8 9 10 ...
$ alc_11 : chr "2" "2" "2" "1" ...
$ alc_17 : chr "1" "1" "2" "1" ...
$ alc_18 : chr "2" "2" "2" "2" ...
$ clc_sex : chr "2" "2" "2" "1" ...
$ alcdwky : num 55.47 8.82 76.74 20.65 78.03 ...
$ ammdmva1: num 127.1 99.2 222.5 275.5 185.2 ...
$ ammdmva2: num 105 337 18 131 337 ...
$ ammdmva3: num 114.8 124.5 285.4 66.4 28.5 ...
$ ammdmva4: num 263 140 52 286 162 ...
$ ammdmva5: num 347.1 120 66.1 94.7 53.7 ...
$ ammdmva6: num 284.4 167.3 132.3 342.3 94.5 ...
$ ammdmva7: num 372.8 162.3 81.9 266 182.1 ...
$ bpmdpbpd: num 95.4 83.3 54.6 94.7 71.1 ...
$ bpmdpbps: num 110.8 143.9 149.6 89.7 138.8 ...
$ clc_age : num 74.9 45.8 61.5 53.2 45.8 ...
First 5 rows:
| id | alc_11 | alc_17 | alc_18 | clc_sex | alcdwky | ammdmva1 | ammdmva2 | ammdmva3 | ammdmva4 | ammdmva5 | ammdmva6 | ammdmva7 | bpmdpbpd | bpmdpbps | clc_age |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 2 | 1 | 2 | 2 | 55.468431 | 127.0529 | 104.96991 | 114.80790 | 262.72802 | 347.13607 | 284.3697 | 372.78829 | 95.41136 | 110.78141 | 74.85507 |
| 2 | 2 | 1 | 2 | 2 | 8.817749 | 99.2286 | 337.08440 | 124.51908 | 139.91543 | 119.98442 | 167.2858 | 162.25131 | 83.33644 | 143.93198 | 45.84278 |
| 3 | 2 | 2 | 2 | 2 | 76.741093 | 222.5007 | 18.04991 | 285.35616 | 51.97035 | 66.12100 | 132.3284 | 81.87789 | 54.62073 | 149.60858 | 61.48123 |
| 4 | 1 | 1 | 2 | 1 | 20.652405 | 275.4628 | 130.82593 | 66.40645 | 285.97570 | 94.70487 | 342.3053 | 265.99578 | 94.74711 | 89.69766 | 53.21518 |
| 5 | 2 | 2 | 2 | 1 | 78.027656 | 185.1855 | 337.06596 | 28.49568 | 161.50520 | 53.68667 | 94.5351 | 182.09282 | 71.11196 | 138.83499 | 45.78124 |
Missing values: No missing values
Summary
This example demonstrated generating mock CHMS data for testing chmsflow harmonization workflows. The generated data:
- Respects category ranges from variable_details
- Includes appropriate missing values
- Uses reproducible seeds
- Can be used to test harmonization functions before accessing real CHMS data
Next steps
- Test your chmsflow harmonization pipeline on this mock data
- Generate mock data for additional cycles as needed
- Calculate derived variables after harmonization (not during mock data generation)