Missing data (NA –> tagged_na)

Base R supports only one type of NA (‘not available’) to represent missing values. The CCHS, however, has several types of missing data or incomplete category responses. cchsflow uses the haven package tagged_na() to allow multiple missing data categories. tagged_na() adds an addition character to the NA value, thereby allowing users to define additional missing data types. tagged_na() applies only for numeric values, as character base values can use any string to represent NA or missing data.

CCHS approach to coding missing data

CCHS category values can change across variables and survey cycles, but the most common values are:

CCHS category value Label
6 not applicable
7 don’t know
8 refusal
9 not stated

cchsflow recodes the four CCHS category missing data categories values into two NA values that are commonly used for most studies. NA(a) = ‘not applicable’ and NA(b) = ‘missing’. Furthermore, variables may be entirely missing from a specific cycle, which results in ‘not asked’ missing data that is not included or coded in CCHS surveys, recoded to NA(a)

Summary of tagged_na values and their corresponding CCHS category values.

cchsflow tagged_na CCHS category value Label
NA(a) 6 not applicable
NA(b) 7 don’t know
NA(b) 8 refusal
NA(b) 9 not stated
NA(c) question not asked in the survey cycle

Example haven::tagged_na()

x <- c(1:5, tagged_na("a"), tagged_na("b"))

# Is used to read the tagged NA in most other functions they are still viewed as NA
## [1] NA  NA  NA  NA  NA  "a" "b"
# Is used to print the na as well as their tag
## [1]     1     2     3     4     5 NA(a) NA(b)
# Tagged NA's work identically to regular NAs
## [1]  1  2  3  4  5 NA NA

Using tagged_na for creating derived variables

See the derived variables and how to add variables for details on derived variables and how to add them.

When creating derived variables from CCHS variables, it is important to distinguish the different NA values. Certain derived variables include the use of variables that may not be applicable to respondents. For example, smoking pack-years involves the use of smoking variables that may not be applicable to all CCHS respondents (i.e. non-smokers who have never smoked). In this scenario, respondents who had values of NA(a) for the various base smoking variables would have a value of NA(a) for smoking pack-years. Respondents who had identified as having smoked at one point in their lives, but had values of NA(b) in the base smoking variables would have a value of NA(b) for smoking pack-years as pack-years cannot be calculated due to missing values.

On the other hand, there are certain derived variables that use variables that are applicable to all respondents. For example, BMI uses CCHS height and weight variables which are asked to all CCHS respondents. In this scenario, all NA variables would be tagged as NA(b) as these variables are applicable to everyone, and respondents with NA values would be classified as missing.

When creating deriving variables, it is important to examine the base CCHS variables to check for the presence of NA(a) and NA(b). This can be done by reviewing the CCHS documentation.