Skip to contents

About this vignette: All numeric values shown in this vignette are computed from the actual CCHS sample metadata files. Code is hidden by default for readability, but you can view the source .qmd file to see how values are calculated.

Overview

This example demonstrates generating mock Canadian Community Health Survey (CCHS) data using cchsflow metadata. The generated mock data can be used to test harmonization workflows before accessing real CCHS data.

Load metadata

Load the CCHS harmonization metadata files that define variables and their details across cycles.

# CCHS sample variables
variables <- read.csv(
  system.file("extdata/cchs/variables_cchsflow_sample.csv", package = "MockData"),
  header = TRUE,
  check.names = FALSE,
  na.strings = c("", "NA", "N/A"),
  stringsAsFactors = FALSE
)

# CCHS sample variable details
variable_details <- read.csv(
  system.file("extdata/cchs/variable_details_cchsflow_sample.csv", package = "MockData"),
  header = TRUE,
  check.names = FALSE,
  na.strings = c("", "NA", "N/A"),
  stringsAsFactors = FALSE
)

The CCHS sample metadata includes 20 harmonized variables with 126 detail rows.

Extract available cycles

The sample metadata includes 8 CCHS cycles: cchs2003_p, cchs2005_p, cchs2007_2008_p, cchs2009_2010_p, cchs2011_2012_p, cchs2013_2014_p, cchs2001_p, cchs2003_p.

Understanding the metadata

The sample metadata includes 20 harmonized variables. For cchs2003_p, there are 20 harmonized variables available:

variable variable_raw variableType label
ADL_01 RACC_6A Categorical Help preparing meals
ADL_02 RACC_6B1 Categorical Help appointments/errands
ADL_03 RACC_6C Categorical Help housework
ADL_04 RACC_6E Categorical Help personal care
ADL_05 RACC_6F Categorical Help move inside house
ADL_06 RACC_6G Categorical Help personal finances
ADL_07 RACC_6D Categorical Help heavy household chores
ADLF6R RACCF6R Categorical Help tasks
ADL_der NA Categorical Help tasks
ADM_RNO ADMC_RNO Continuous Sequential record number

Get raw variables to generate

For mock data generation, we need to create the raw source variables (before harmonization), not the harmonized variables.

# Get unique raw variables needed for this cycle (excludes derived variables)
raw_vars <- get_raw_variables(example_cycle, variables, variable_details,
                               include_derived = FALSE)

This cycle requires 19 unique raw variables: 11 categorical and 8 continuous.

variable_raw variableType harmonized_vars n_harmonized
ADMC_RNO Continuous ADM_RNO 1
ALCC_1 Categorical ALC_1 1
ALCC_5 Categorical ALW_1 1
ALCC_5A1 Continuous ALW_2A1 1
ALCC_5A2 Continuous ALW_2A2 1
ALCC_5A3 Continuous ALW_2A3 1
ALCC_5A4 Continuous ALW_2A4 1
ALCC_5A5 Continuous ALW_2A5 1
ALCC_5A6 Continuous ALW_2A6 1
ALCC_5A7 Continuous ALW_2A7 1
ALCCDTYP Categorical ALCDTTM 1
RACC_6A Categorical ADL_01 1
RACC_6B1 Categorical ADL_02 1
RACC_6C Categorical ADL_03 1
RACC_6D Categorical ADL_07 1
RACC_6E Categorical ADL_04 1
RACC_6F Categorical ADL_05 1
RACC_6G Categorical ADL_06 1
RACCF6R Categorical ADLF6R 1

Generate mock data for one cycle

Now let’s generate mock data for a single cycle.

# Configuration
n_records <- 100
target_cycle <- example_cycle
seed <- 12345

# Initialize data frame
df_mock <- data.frame(id = 1:n_records)

We’ll generate 100 mock records for cchs2003_p.

Generate categorical variables

Generated 11 categorical variables. Data frame now has 12 columns.

Generate continuous variables

Generated 8 continuous variables. Final data frame has 20 columns.

Examine the result

Mock data structure:

'data.frame':   100 obs. of  20 variables:
 $ id      : int  1 2 3 4 5 6 7 8 9 10 ...
 $ ALCC_1  : chr  "2" "2" "2" "1" ...
 $ ALCC_5  : chr  "31" "27" "18" "35" ...
 $ ALCCDTYP: chr  "2" "3" "3" "2" ...
 $ RACC_6A : chr  "2" "2" "2" "1" ...
 $ RACC_6B1: chr  "1" "2" "2" "1" ...
 $ RACC_6C : chr  "1" "1" "1" "1" ...
 $ RACC_6D : chr  "1" "2" "1" "1" ...
 $ RACC_6E : chr  "2" "2" "2" "1" ...
 $ RACC_6F : chr  "2" "1" "1" "1" ...
 $ RACC_6G : chr  "2" "2" "2" "2" ...
 $ RACCF6R : chr  "2" "2" "1" "1" ...
 $ ADMC_RNO: num  660337 104974 913583 245862 928899 ...
 $ ALCC_5A1: num  15.7 12.3 27.5 34.1 22.9 ...
 $ ALCC_5A2: num  12.99 41.72 2.23 16.19 41.72 ...
 $ ALCC_5A3: num  14.21 15.41 35.32 8.22 3.53 ...
 $ ALCC_5A4: num  32.52 17.32 6.43 35.39 19.99 ...
 $ ALCC_5A5: num  42.96 14.85 8.18 11.72 6.64 ...
 $ ALCC_5A6: num  35.2 20.7 16.4 42.4 11.7 ...
 $ ALCC_5A7: num  46.1 20.1 10.1 32.9 22.5 ...

First 5 rows:

id ALCC_1 ALCC_5 ALCCDTYP RACC_6A RACC_6B1 RACC_6C RACC_6D RACC_6E RACC_6F RACC_6G RACCF6R ADMC_RNO ALCC_5A1 ALCC_5A2 ALCC_5A3 ALCC_5A4 ALCC_5A5 ALCC_5A6 ALCC_5A7
1 2 31 2 2 1 1 1 2 2 2 2 660337.5 15.72437 12.99133 14.208899 32.515844 42.962386 35.19427 46.13716
2 2 27 3 2 2 1 2 2 1 2 2 104973.9 12.28077 41.71837 15.410777 17.316266 14.849557 20.70368 20.08061
3 2 18 3 2 2 1 1 2 1 2 1 913582.7 27.53722 2.23390 35.316357 6.431974 8.183292 16.37728 10.13340
4 1 35 2 1 1 1 1 1 1 2 1 245862.2 34.09194 16.19133 8.218619 35.393032 11.720900 42.36451 32.92027
5 2 26 3 1 1 1 1 1 2 2 2 928898.9 22.91900 41.71608 3.526694 19.988267 6.644390 11.69989 22.53624

Missing values: No missing values

Summary

This example demonstrated generating mock CCHS data for testing cchsflow harmonization workflows. The generated data:

  • Respects category ranges from variable_details
  • Includes appropriate missing values
  • Uses reproducible seeds
  • Can be used to test harmonization functions before accessing real CCHS data

Next steps

  • Test your cchsflow harmonization pipeline on this mock data
  • Generate mock data for additional cycles as needed
  • Calculate derived variables after harmonization (not during mock data generation)