# Generate smoking data with two approaches
set.seed(123)
# Wrong approach: treating all missing as NA
smoking_wrong <- data.frame(
smoker = sample(c("Yes", "No", NA), 1000,
replace = TRUE,
prob = c(0.20, 0.70, 0.10))
)
# Correct approach: using survey missing codes
smoking_correct <- data.frame(
smoker = sample(c("1", "2", "6", "7", "9"), 1000,
replace = TRUE,
prob = c(0.20, 0.70, 0.05, 0.03, 0.02))
)About this vignette: This tutorial teaches you how to generate realistic missing data in mock health survey datasets. You’ll learn how to add missing data codes commonly used in Canadian health surveys (like the CCHS and CHMS), including valid skip, don’t know, refusal, and not stated. All code examples run during vignette build to ensure accuracy.
Understanding missing data in health surveys
Health surveys use structured missing data codes to distinguish between different types of non-response. Unlike general R data where NA represents any missing value, survey data differentiates between several categories of missingness. This distinction is crucial for accurate statistical analysis.
Consider a simple example: calculating smoking prevalence from survey data. If you treat all missing values the same way, you’ll get biased estimates. Let’s see why.
Now let’s calculate prevalence both ways:
# Wrong calculation (naive approach)
wrong_prevalence <- mean(smoking_wrong$smoker == "Yes", na.rm = TRUE)
# Correct calculation (excluding valid skip, including DK/RF/NS in denominator)
valid_responses <- smoking_correct$smoker %in% c("1", "2", "7", "9")
correct_prevalence <- sum(smoking_correct$smoker == "1") / sum(valid_responses)The naive approach gives us a prevalence of 22.4%, while the correct approach gives 22%. This difference matters when making population-level estimates or comparing across surveys.
Why missing data codes matter
In health surveys, not all missing values mean the same thing. Canadian health surveys like the Canadian Community Health Survey (CCHS) and Canadian Health Measures Survey (CHMS) use standardized systems of missing data codes:
- Valid skip (996): The question was not asked because skip logic determined it wasn’t applicable
- Don’t know (997): The question was asked but the respondent didn’t know the answer
- Refusal (998): The question was asked but the respondent refused to answer
- Not stated (999): The question was asked but no response was recorded
Each type has different statistical implications. Valid skips should be excluded from your denominator (they weren’t part of the eligible population for that question). But don’t know, refusal, and not stated should be included in the denominator when calculating response rates, even though they’re excluded from the numerator when calculating prevalence.
Let’s generate some realistic data to see this in action:
# Load metadata for examples
variable_details <- read.csv(
system.file("extdata/minimal-example/variable_details.csv",
package = "MockData"),
stringsAsFactors = FALSE
)
variables <- read.csv(
system.file("extdata/minimal-example/variables.csv",
package = "MockData"),
stringsAsFactors = FALSE
)
# Create smoking data using MockData
# smoking has categories: 1=Never, 2=Former, 3=Current, 7=Don't know
df_mock <- data.frame()
smoking_data <- create_cat_var(
var = "smoking",
databaseStart = "minimal-example",
variables = variables,
variable_details = variable_details,
df_mock = df_mock,
n = 1000,
seed = 456
)
# Show distribution
table(smoking_data$smoking)
1 2 3 7
466 309 195 30
The three types of missing data codes
Health surveys categorize missing data into three main types, each requiring different statistical treatment.
Valid skip (code 996)
A valid skip occurs when skip logic determines a question should not be asked. For example, if someone reports they’ve never smoked, they won’t be asked “How many cigarettes per day do you smoke?” This isn’t truly missing data—it’s a logical consequence of their previous answer.
Statistical treatment: Exclude from both the numerator and denominator. These respondents weren’t eligible for the question.
Don’t know / Refusal / Not stated (codes 997, 998, 999)
These codes represent questions that were asked but didn’t receive valid responses:
- 997 (Don’t know): Respondent was uncertain about the answer
- 998 (Refusal): Respondent declined to answer
- 999 (Not stated): Question was asked but no response was recorded
Statistical treatment: Include in the denominator when calculating response rates (they were eligible and asked), but exclude from the numerator when calculating prevalence (we don’t know their true status).
Not applicable (code 996)
This is similar to valid skip—the question doesn’t apply to the respondent’s situation. The statistical treatment is the same as valid skip.
Let’s demonstrate the difference with a worked example using smoking status:
# Calculate response rate (includes DK in denominator, excludes any skip codes)
asked <- smoking_data$smoking %in% c("1", "2", "3", "7")
valid_response <- smoking_data$smoking %in% c("1", "2", "3")
response_rate <- sum(valid_response) / sum(asked)
# Calculate prevalence of current smoking (excludes DK from numerator, but includes in denominator)
current_smoker <- smoking_data$smoking == "3"
prevalence <- sum(current_smoker) / sum(asked)In this dataset:
- Sample size: 1000 respondents
- Asked: 1000 (all respondents for this variable)
- Response rate: 97% (valid responses ÷ asked)
- Current smoking prevalence: 19.5% (current smokers ÷ asked)
- Don’t know responses: 30 (3% of those asked)
Notice how “Don’t know” (code 7) is included in the denominator when calculating response rate but excluded from the numerator when calculating current smoking prevalence.
Example with both skip and missing codes
Let’s look at the BMI variable, which demonstrates both valid skip (NA::a) and missing codes (NA::b):
# Look at BMI variable metadata
bmi_details <- variable_details[variable_details$variable == "BMI", ]
print(bmi_details[, c("variable", "recStart", "recEnd", "catLabel", "proportion")], row.names = FALSE) variable recStart recEnd catLabel proportion
BMI [15,50] copy Valid BMI range NA
BMI 996 NA::a Not applicable 0.3
BMI [997,999] NA::b Don't know, refusal, not stated 0.1
This shows how missing codes appear in survey metadata. The recStart column contains the category codes, including continuous range ([15,50]), valid skip (996 for not applicable, mapped to NA::a), and missing data codes ([997,999] for don’t know/refusal/not stated, mapped to NA::b).
We can use this metadata to generate mock data with realistic proportions:
# Generate age data with missing codes
df_mock_age <- data.frame()
age_data <- create_con_var(
var = "age",
databaseStart = "minimal-example",
variables = variables,
variable_details = variable_details,
df_mock = df_mock_age,
n = 5000,
seed = 789
)
# Calculate response rate excluding missing codes
valid_ages <- !is.na(age_data$age) & age_data$age >= 18 & age_data$age <= 100
missing_codes <- age_data$age %in% c(997, 998, 999)
asked_age <- valid_ages | missing_codes
response_rate_age <- sum(valid_ages) / sum(asked_age)
# Summary statistics on valid ages only
mean_age <- mean(age_data$age[valid_ages], na.rm = TRUE)For this age variable:
- Sample size: 5000 respondents
- Response rate: 90% (4500 valid responses ÷ 5000 asked)
- Mean age (valid responses only): 50.1 years
- Don’t know responses: 247 (4.9% of those asked)
- Refusal: 146 (2.9% of those asked)
- Not stated: 107 (2.1% of those asked)
This demonstrates how health survey data includes measurable proportions of missing data codes, and why distinguishing between them matters for accurate statistical reporting.
The recEnd column: Enabling missing data classification
To generate realistic missing data, MockData needs to distinguish between valid response codes (1, 2, 3) and missing data codes (6-9, 96-99). The recEnd column in variable_details.csv provides this classification. We follow the recodeflow approach that supports tagged NA using the haven and other packages. recodeflow can add tagged_NA when recoding and tranforming variables. NA::a is assigned ‘not applicable’ and NA::b is assing ‘missing/refused/not stated’.
How recEnd works
The recEnd column maps input codes to their classification:
Valid responses: Map codes to themselves using numeric values
# variable_details.csv
uid,uid_detail,variable,recStart,recEnd,catLabel,proportion
v_002,d_002,smoking,1,1,Daily smoker,0.25
v_002,d_003,smoking,2,2,Occasional,0.15
v_002,d_004,smoking,3,3,Never,0.57Missing codes: Use NA::b for don’t know, refusal, not stated
# variable_details.csv (continued)
v_002,d_005,smoking,7,NA::b,Don't know,0.03Skip codes: Use NA::a for valid skip/not applicable
v_002,d_006,smoking,6,NA::a,Valid skip,0.01Why recEnd is required
Without recEnd, MockData cannot tell the difference between:
- Code 1 (valid: Daily smoker)
- Code 7 (missing: Don’t know)
Both are just numbers in recStart. The recEnd column provides explicit classification:
-
recEnd = "1"→ Valid response code -
recEnd = "NA::b"→ Missing data (DK/Refusal/NS) -
recEnd = "NA::a"→ Skip code (not applicable)
Conditional requirement
The recEnd column is conditionally required when your variable_details.csv contains missing data codes (6-9, 96-99):
Validation error: If you use codes like 7, 8, or 9 in recStart without a recEnd column, you’ll get this error:
recEnd column required in variable_details when using missing data codes (6-9, 96-99).
Use 'NA::a' for skip codes (6, 96, 996),
Use 'NA::b' for missing codes (7-9, 97-99) representing DK/Refusal/NS,
and numeric codes (e.g., '1', '2', '3') for valid responses.
When recEnd is optional: For simple variables without missing codes, recEnd can be omitted:
# Simple categorical variable (no missing codes) - recEnd optional
uid,uid_detail,variable,recStart,catLabel,proportion
v_010,d_100,gender,1,Male,0.5
v_010,d_101,gender,2,Female,0.5Complete example with recEnd
# Create variable_details with recEnd column
smoking_details_recEnd <- data.frame(
uid = "cchsflow_v0002",
uid_detail = c("cchsflow_d00005", "cchsflow_d00006", "cchsflow_d00007", "cchsflow_d00008"),
variable = "smoking",
recStart = c("1", "2", "3", "7"),
recEnd = c("1", "2", "3", "NA::b"), # Explicit classification
catLabel = c("Never smoker", "Former smoker", "Current smoker", "Don't know"),
proportion = c(0.5, 0.3, 0.17, 0.03),
stringsAsFactors = FALSE
)
# Show the configuration
print(smoking_details_recEnd, row.names = FALSE) uid uid_detail variable recStart recEnd catLabel
cchsflow_v0002 cchsflow_d00005 smoking 1 1 Never smoker
cchsflow_v0002 cchsflow_d00006 smoking 2 2 Former smoker
cchsflow_v0002 cchsflow_d00007 smoking 3 3 Current smoker
cchsflow_v0002 cchsflow_d00008 smoking 7 NA::b Don't know
proportion
0.50
0.30
0.17
0.03
This configuration tells MockData:
- Codes 1, 2, 3 are valid responses (recEnd maps to themselves)
- Code 7 is a missing code (recEnd = “NA::b”)
- When generating data, 3% will be “Don’t know” responses
NA:: conventions
MockData uses recodeflow’s NA:: convention for missing data classification:
NA::a (skip codes):
- Question not asked due to skip logic
- Example: “Have you smoked in the last 30 days?” is skipped for never smokers
- Statistical treatment: Exclude from denominator (not eligible)
- Common codes: 6, 96, 996
NA::b (missing codes):
- Question asked but no valid response given
- Includes: Don’t know (7, 97), Refusal (8, 98), Not stated (9, 99)
- Statistical treatment: Include in denominator for response rates
- Common codes: 7-9, 97-99
This convention enables automatic missing data handling throughout MockData’s generation functions.