Base R supports only one type of NA (‘not available’) to represent missing values. The CCHS, however, has several types of missing data or incomplete category responses.
cchsflow uses the
tagged_na() to allow multiple missing data categories.
tagged_na() adds an addition character to the NA value, thereby allowing users to define additional missing data types.
tagged_na() applies only for numeric values, as character base values can use any string to represent NA or missing data.
CCHS category values can change across variables and survey cycles, but the most common values are:
|CCHS category value||Label|
cchsflow recodes the four CCHS category missing data categories values into two NA values that are commonly used for most studies. NA(a) = ‘not applicable’ and NA(b) = ‘missing’. Furthermore, variables may be entirely missing from a specific cycle, which results in ‘not asked’ missing data that is not included or coded in CCHS surveys, recoded to NA(a)
tagged_na values and their corresponding CCHS category values.
||CCHS category value||Label|
|NA(c)||question not asked in the survey cycle|
##  NA NA NA NA NA "a" "b"
# Is used to print the na as well as their tag print_tagged_na(x)
##  1 2 3 4 5 NA(a) NA(b)
# Tagged NA's work identically to regular NAs x
##  1 2 3 4 5 NA NA
##  FALSE FALSE FALSE FALSE FALSE TRUE TRUE
When creating derived variables from CCHS variables, it is important to distinguish the different NA values. Certain derived variables include the use of variables that may not be applicable to respondents. For example, smoking pack-years involves the use of smoking variables that may not be applicable to all CCHS respondents (i.e. non-smokers who have never smoked). In this scenario, respondents who had values of
NA(a) for the various base smoking variables would have a value of
NA(a) for smoking pack-years. Respondents who had identified as having smoked at one point in their lives, but had values of
NA(b) in the base smoking variables would have a value of
NA(b) for smoking pack-years as pack-years cannot be calculated due to missing values.
On the other hand, there are certain derived variables that use variables that are applicable to all respondents. For example, BMI uses CCHS height and weight variables which are asked to all CCHS respondents. In this scenario, all NA variables would be tagged as
NA(b) as these variables are applicable to everyone, and respondents with NA values would be classified as missing.
When creating deriving variables, it is important to examine the base CCHS variables to check for the presence of
NA(b). This can be done by reviewing the CCHS documentation.