NA
–> tagged_na
)
Base R supports only one type of NA (‘not available’) to represent
missing values. The CCHS, however, has several types of missing data or
incomplete category responses. cchsflow
uses the
haven
package tagged_na()
to allow multiple missing data categories. tagged_na()
adds
an addition character to the NA value, thereby allowing users to define
additional missing data types. tagged_na()
applies only for
numeric values, as character base values can use any string to represent
NA or missing data.
CCHS category values can change across variables and survey cycles, but the most common values are:
CCHS category value | Label |
---|---|
6 | not applicable |
7 | don’t know |
8 | refusal |
9 | not stated |
cchsflow
recodes the four CCHS category missing data
categories values into two NA values that are commonly used for most
studies. NA(a) = ‘not applicable’ and NA(b) = ‘missing’. Furthermore,
variables may be entirely missing from a specific cycle, which results
in ‘not asked’ missing data that is not included or coded in CCHS
surveys, recoded to NA(a)
Summary of tagged_na
values and their
corresponding CCHS category values.
cchsflow tagged_na
|
CCHS category value | Label |
---|---|---|
NA(a) | 6 | not applicable |
NA(b) | 7 | don’t know |
NA(b) | 8 | refusal |
NA(b) | 9 | not stated |
NA(c) | question not asked in the survey cycle |
haven::tagged_na()
library(haven)
x <- c(1:5, tagged_na("a"), tagged_na("b"))
# Is used to read the tagged NA in most other functions they are still viewed as NA
na_tag(x)
## [1] NA NA NA NA NA "a" "b"
# Is used to print the na as well as their tag
print_tagged_na(x)
## [1] 1 2 3 4 5 NA(a) NA(b)
# Tagged NA's work identically to regular NAs
x
## [1] 1 2 3 4 5 NA NA
is.na(x)
## [1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE
See the derived variables and how to add variables for details on derived variables and how to add them.
When creating derived variables from CCHS variables, it is important
to distinguish the different NA values. Certain derived variables
include the use of variables that may not be applicable to respondents.
For example, smoking
pack-years involves the use of smoking variables that may not be
applicable to all CCHS respondents (i.e. non-smokers who have never
smoked). In this scenario, respondents who had values of
NA(a)
for the various base smoking variables would have a
value of NA(a)
for smoking pack-years. Respondents who had
identified as having smoked at one point in their lives, but had values
of NA(b)
in the base smoking variables would have a value
of NA(b)
for smoking pack-years as pack-years cannot be
calculated due to missing values.
On the other hand, there are certain derived variables that use
variables that are applicable to all respondents. For example, BMI
uses CCHS height and weight variables which are asked to all CCHS
respondents. In this scenario, all NA variables would be tagged as
NA(b)
as these variables are applicable to everyone, and
respondents with NA values would be classified as missing.
When creating deriving variables, it is important to examine the base
CCHS variables to check for the presence of NA(a)
and
NA(b)
. This can be done by reviewing the CCHS documentation.