variable_details worksheet

Introduction

The worksheet variable_details maps variables across datasets. variable_details also contains the instructions for function rec_with_table on how to recode the variables listed in the worksheet variables.

Specifically, the function rec_with_table recodes variables listed in column variable_details$variableStart to the variables listed in variable_details$variable.

#> In the dataframe `variable_details` there are 119 rows and 16 columns

Structure of the worksheet `variable_details`

Naming convention for not applicable and missing values

rec_with_table() uses the tagged_na() function from the haven package to tag not applicable responses as NA(a), and missing values (don’t know, refusal, not stated) as NA(b). As you will see later, not applicable values are recoded to NA::a, and missing values are recoded to NA::b. See tagged_na_usage for more information about tagged_na().

Rows

Each row in variable_details gives instructions to recode a single category of a final variable. Typically, a single category of a final variable maps to a single category of an original variable. Therefore, in variable_details you’ll have one row per category.

There are two exceptions to this rule.

Original categories are sometimes combined into a single final variable category. For example,

Most often survey responses “don’t know,” “refusal,” and “not stated” are combined into a final single “missing” category.
Original age group categories are often combined into larger age groups categories (e.g., original five-year age group categories are combined into final ten-year age group categories).

Original datasets with different categories map to separate rows for a final variable. For example,

The final variable, “Sex,” has categories “1” = male and “2” = female.
Dataset A has sex categories “m” = male and “f” = female.
Dataset B has categories “1” = male and “2” = female.
Variable_details requires four rows to map the final “Sex” categories = two for dataset A and two for dataset B.

Note: We recommend not combining variables if the variable has an important change between datasets. variable_detailsdetails$notes is used to identify issues that may be relevant when recoding the variable or category.

Columns

Here we explain the columns in the worksheet variable_details.

Table 1. Summary of the source and level of information used for each column in the worksheet ‘variable_details’.

	Level of information
Data_source	Dataset	Variable	Categories
recoded		variable *	recEnd *
		typeEnd *	numValidCat
			catLabel
			catLongLabel
			units
original	databaseStart	variableStart *	recStart *
		typeStart *	catStartLabel
		variableStartlabel
both	notes	notes	notes
^* Columns required for the function rec_with_table. All other columns are recommended but optional.

The order of the columns within the worksheet variable_details is not important because recodeflow functions use the column names.

variable: the name of the final recoded variable.

variable
alk.phos
alk.phos
ascites
ascites
ascites
ascites
alk.phos

typeEnd: the data type of the final recoded variable. If the recoded variable is continuous fill in as cont; if the recoded variable that is categorical fill in as cat .

variable	dummyVariable
alk.phos	NA
alk.phos	NA
ascites	ascites_cat3_1
ascites	ascites_cat3_2
ascites	ascites_cat3_NA::b
ascites	ascites_cat3_NA::a
alk.phos	NA

typeStart: the variable type as indicated in datasets. As indicated in the typeEnd column, categorical variables are denoted as cat and continuous variables are denoted as cont.

variable	dummyVariable	typeEnd
alk.phos	NA	cont
alk.phos	NA	cont
ascites	ascites_cat3_1	cat
ascites	ascites_cat3_2	cat
ascites	ascites_cat3_NA::b	cat
ascites	ascites_cat3_NA::a	cat
alk.phos	NA	cont

databaseStart: the datasets that contain the variable and categories of interest, separated by commas.

variable	dummyVariable	typeEnd	typeStart
alk.phos	NA	cont	cont
alk.phos	NA	cont	cont
ascites	ascites_cat3_1	cat	cat
ascites	ascites_cat3_2	cat	cat
ascites	ascites_cat3_NA::b	cat	cat
ascites	ascites_cat3_NA::a	cat	cat
alk.phos	NA	cont	cont

variableStart: the original names of the variables as they are listed in the original datasets, separated by commas.

If the variable name in a particular dataset is different from the recoded variable name, write out the dataset name, add two colons, and write out the original variable name for that dataset.
If the variable name in a particular dataset is the same as the recoded variable name, write the variable name in square brackets. Note: this only needs to be written out once.

variable	dummyVariable	typeEnd	typeStart	databaseStart
alk.phos	NA	cont	cont	tester1; tester2
alk.phos	NA	cont	cont	tester1; tester2
ascites	ascites_cat3_1	cat	cat	tester1; tester2
ascites	ascites_cat3_2	cat	cat	tester1; tester2
ascites	ascites_cat3_NA::b	cat	cat	tester1; tester2
ascites	ascites_cat3_NA::a	cat	cat	tester1; tester2
alk.phos	NA	cont	cont	tester1; tester2

variableStartLabel: short form label describing the original variable.

variable	dummyVariable	typeEnd	typeStart	databaseStart	variableStart
alk.phos	NA	cont	cont	tester1; tester2	[alk.phos]
alk.phos	NA	cont	cont	tester1; tester2	[alk.phos]
ascites	ascites_cat3_1	cat	cat	tester1; tester2	[ascites]
ascites	ascites_cat3_2	cat	cat	tester1; tester2	[ascites]
ascites	ascites_cat3_NA::b	cat	cat	tester1; tester2	[ascites]
ascites	ascites_cat3_NA::a	cat	cat	tester1; tester2	[ascites]
alk.phos	NA	cont	cont	tester1; tester2	[alk.phos]

numValidCat: the number of categories for the recoded variable. This only applies to variables in which the typeEnd is cat. For continuous variables, numValidCat = N/A. Note that for categories: not applicable, missing, and else, are not included in the category count. This column is not necessary for the function rec_with_table().

variable	dummyVariable	typeEnd	typeStart	databaseStart	variableStart	variableStartLabel
alk.phos	NA	cont	cont	tester1; tester2	[alk.phos]	alk.phos
alk.phos	NA	cont	cont	tester1; tester2	[alk.phos]	alk.phos
ascites	ascites_cat3_1	cat	cat	tester1; tester2	[ascites]	ascites
ascites	ascites_cat3_2	cat	cat	tester1; tester2	[ascites]	ascites
ascites	ascites_cat3_NA::b	cat	cat	tester1; tester2	[ascites]	ascites
ascites	ascites_cat3_NA::a	cat	cat	tester1; tester2	[ascites]	ascites
alk.phos	NA	cont	cont	tester1; tester2	[alk.phos]	alk.phos

recEnd: the category of the categorical value your recoding to. For the not applicable category, write NA::a and for missing and else categories, write NA::b. For continuous variables that are not recoded in type, you would write in this column copy so that the function copies the values without any recoding.

variable	dummyVariable	typeEnd	typeStart	databaseStart	variableStart	variableStartLabel	numValidCat
alk.phos	NA	cont	cont	tester1; tester2	[alk.phos]	alk.phos	NA
alk.phos	NA	cont	cont	tester1; tester2	[alk.phos]	alk.phos	NA
ascites	ascites_cat3_1	cat	cat	tester1; tester2	[ascites]	ascites	2
ascites	ascites_cat3_2	cat	cat	tester1; tester2	[ascites]	ascites	2
ascites	ascites_cat3_NA::b	cat	cat	tester1; tester2	[ascites]	ascites	2
ascites	ascites_cat3_NA::a	cat	cat	tester1; tester2	[ascites]	ascites	2
alk.phos	NA	cont	cont	tester1; tester2	[alk.phos]	alk.phos	NA

For categorical variables that remain categorical (i.e. from: cat -> to: cat), it is ideal to retain the same values as indicated in the original datasets.
For categorical variables that have changed in type (i.e from: cat -> to: cont), you will have to develop values that make the most sense to your analysis. For example, for variables that change from: cat -> to: cont, you can choose the categories’ lower bound, mid-point, or upper-bound for the continuous value.

catLabel: short form label describing the category of a particular variable.

variable	dummyVariable	typeEnd	typeStart	databaseStart	variableStart	variableStartLabel	numValidCat	recEnd
alk.phos	NA	cont	cont	tester1; tester2	[alk.phos]	alk.phos	NA	copy
alk.phos	NA	cont	cont	tester1; tester2	[alk.phos]	alk.phos	NA	NA::b
ascites	ascites_cat3_1	cat	cat	tester1; tester2	[ascites]	ascites	2	0
ascites	ascites_cat3_2	cat	cat	tester1; tester2	[ascites]	ascites	2	1
ascites	ascites_cat3_NA::b	cat	cat	tester1; tester2	[ascites]	ascites	2	NA::b
ascites	ascites_cat3_NA::a	cat	cat	tester1; tester2	[ascites]	ascites	2	NA::a
alk.phos	NA	cont	cont	tester1; tester2	[alk.phos]	alk.phos	NA	NA::a

catLabelLong: more detailed label describing the category of a particular variable. If the category is unchanged from original data, this label should be identical to what is recorded in the original data’s documentation (e.g., the exact survey question). For derived variables, you can create your own long label.

variable	dummyVariable	typeEnd	typeStart	databaseStart	variableStart	variableStartLabel	numValidCat	recEnd	catLabel
alk.phos	NA	cont	cont	tester1; tester2	[alk.phos]	alk.phos	NA	copy	NA
alk.phos	NA	cont	cont	tester1; tester2	[alk.phos]	alk.phos	NA	NA::b	missing
ascites	ascites_cat3_1	cat	cat	tester1; tester2	[ascites]	ascites	2	0	ascites 0
ascites	ascites_cat3_2	cat	cat	tester1; tester2	[ascites]	ascites	2	1	ascites 1
ascites	ascites_cat3_NA::b	cat	cat	tester1; tester2	[ascites]	ascites	2	NA::b	missing
ascites	ascites_cat3_NA::a	cat	cat	tester1; tester2	[ascites]	ascites	2	NA::a	not applicable
alk.phos	NA	cont	cont	tester1; tester2	[alk.phos]	alk.phos	NA	NA::a	not applicable

units: the units of a particular variable. If there are no units for the variable, write N/A. Note, the function will not work if there different units between the rows of the same variable (e.g. if the concentration of a single molecule has two records one recorded as mg/dl and the other recorded as g/dl).

variable	dummyVariable	typeEnd	typeStart	databaseStart	variableStart	variableStartLabel	numValidCat	recEnd	catLabel	catLabelLong
alk.phos	NA	cont	cont	tester1; tester2	[alk.phos]	alk.phos	NA	copy	NA	NA
alk.phos	NA	cont	cont	tester1; tester2	[alk.phos]	alk.phos	NA	NA::b	missing	missing
ascites	ascites_cat3_1	cat	cat	tester1; tester2	[ascites]	ascites	2	0	ascites 0	ascites 0
ascites	ascites_cat3_2	cat	cat	tester1; tester2	[ascites]	ascites	2	1	ascites 1	ascites 1
ascites	ascites_cat3_NA::b	cat	cat	tester1; tester2	[ascites]	ascites	2	NA::b	missing	missing
ascites	ascites_cat3_NA::a	cat	cat	tester1; tester2	[ascites]	ascites	2	NA::a	not applicable	not applicable
alk.phos	NA	cont	cont	tester1; tester2	[alk.phos]	alk.phos	NA	NA::a	not applicable	not applicable

recStart: the range of values for a particular category in a variable as indicated in the original data. See data documentation and use the smallest and large values as your range to capture all values across the datasets.

The rules for each category of a new variable are a string in recStart and value in recEnd. These recode pairs are the same syntax as interval notation in which a closed range of values are specified using square brackets. See Notations for intervals for more information on interval notation. Recode pairs are obtained from the recStart and recEnd columns value range is indicated by a comma, e.g. recStart= [1,4]; recEnd = 1 (recodes all values from 1 to 4 into 1} value range for double vectors (with fractional part), all values within the specified range are recoded; e.g. recStart = [1,2.5]; recEnd = 1 recodes 1 to 2.5 into 1, but 2.55 would not be recoded (since it’s not included in the specified range). NA is used for missing values (don’t know, refusal, not stated) else is used all other values, which have not been specified yet, are indicated by else, e.g. recStart = "else"; recEnd = NA (recode all other values (not specified in other rows) to “NA”)} copy the else token can be combined with copy, indicating that all remaining, not yet recoded values should stay the same (are copied from the original value), e.g. recStart = "else"; recEnd = "copy"

variable	dummyVariable	typeEnd	typeStart	databaseStart	variableStart	variableStartLabel	numValidCat	recEnd
alk.phos	NA	cont	cont	tester1; tester2	[alk.phos]	alk.phos	NA	copy
alk.phos	NA	cont	cont	tester1; tester2	[alk.phos]	alk.phos	NA	NA::b
ascites	ascites_cat3_1	cat	cat	tester1; tester2	[ascites]	ascites	2	0
ascites	ascites_cat3_2	cat	cat	tester1; tester2	[ascites]	ascites	2	1
ascites	ascites_cat3_NA::b	cat	cat	tester1; tester2	[ascites]	ascites	2	NA::b
ascites	ascites_cat3_NA::a	cat	cat	tester1; tester2	[ascites]	ascites	2	NA::a
alk.phos	NA	cont	cont	tester1; tester2	[alk.phos]	alk.phos	NA	NA::a

catStartLabel: label describing each category. This label should be identical to what is shown in the original data documentation. For the missing rows, each missing category is described along with their coded values. You can import labels from a survey’s DDI files using bllflow helper functions. See bllflow documentation.

variable	dummyVariable	typeEnd	typeStart	databaseStart	variableStart	variableStartLabel	numValidCat	recEnd	catLabel	catLabelLong	units	recStart
alk.phos	NA	cont	cont	tester1; tester2	[alk.phos]	alk.phos	NA	copy	NA	NA	NA	[200,15000]
alk.phos	NA	cont	cont	tester1; tester2	[alk.phos]	alk.phos	NA	NA::b	missing	missing	NA	else
ascites	ascites_cat3_1	cat	cat	tester1; tester2	[ascites]	ascites	2	0	ascites 0	ascites 0	NA	0
ascites	ascites_cat3_2	cat	cat	tester1; tester2	[ascites]	ascites	2	1	ascites 1	ascites 1	NA	1
ascites	ascites_cat3_NA::b	cat	cat	tester1; tester2	[ascites]	ascites	2	NA::b	missing	missing	NA	else
ascites	ascites_cat3_NA::a	cat	cat	tester1; tester2	[ascites]	ascites	2	NA::a	not applicable	not applicable	NA	9
alk.phos	NA	cont	cont	tester1; tester2	[alk.phos]	alk.phos	NA	NA::a	not applicable	not applicable	NA	99999

notes: any relevant notes to inform the user running the recode-with-table function. Things to include here would be changes in wording across datasets, missing/changes in categories, and changes in variable type across datasets.

variable	dummyVariable	typeEnd	typeStart	databaseStart	variableStart	variableStartLabel	numValidCat	recEnd	catLabel	catLabelLong	units	recStart	catStartLabel
alk.phos	NA	cont	cont	tester1; tester2	[alk.phos]	alk.phos	NA	copy	NA	NA	NA	[200,15000]	NA
alk.phos	NA	cont	cont	tester1; tester2	[alk.phos]	alk.phos	NA	NA::b	missing	missing	NA	else	NA
ascites	ascites_cat3_1	cat	cat	tester1; tester2	[ascites]	ascites	2	0	ascites 0	ascites 0	NA	0	ascites 0
ascites	ascites_cat3_2	cat	cat	tester1; tester2	[ascites]	ascites	2	1	ascites 1	ascites 1	NA	1	ascites 1
ascites	ascites_cat3_NA::b	cat	cat	tester1; tester2	[ascites]	ascites	2	NA::b	missing	missing	NA	else	ascites 0
ascites	ascites_cat3_NA::a	cat	cat	tester1; tester2	[ascites]	ascites	2	NA::a	not applicable	not applicable	NA	9	not applicable
alk.phos	NA	cont	cont	tester1; tester2	[alk.phos]	alk.phos	NA	NA::a	not applicable	not applicable	NA	99999	NA

Derived Variables

The same naming convention applies to derived variables with the exception of two columns:

In variableStart, instead of database names being listed, DerivedVar:: is written followed with the list of original variables used inside square brackets.

DerivedVar::[var1, var2, var3]

In recEnd, write Func:: followed with the exact name of the custom function used to create the derived variable.

Func::derivedFunction

A derived variable looks like this in variable_details.csv

variable	dummyVariable	typeEnd	typeStart	databaseStart	variableStart	variableStartLabel	numValidCat	recEnd	catLabel	catLabelLong	units	recStart	catStartLabel	variableStartShortLabel	notes
example_der	NA	cont	cont	tester1; tester2	DerivedVar::[chol, bili]	example_der	NA	Func::example_der_fun	N/A	N/A	NA	else	N/A	example_der	A test derived variable

Tables

Certain derived variables may need information from a reference table to complete their derivation. For example, imagine a derived variable that has the average sodium consumption for an individual specific to the individual’s age and sex. It makes sense to put this information in a table that maps the individual’s age and sex to their sodium consumption like below,

age	sex	sodium_consumption
18	male	1800
18	male	1500
19	female	2100
19	female	1600

with the derivation function shown below,

sodium_consumption <- function(age, sex, reference_table) {
  return(reference_table[reference_table$age == age & reference_table$sex == sex, ]$sodium_consumption)
}

The reference_table argument in the above function expects a data frame.

Including a table in the start variables for a derived variable can be done using the syntax, tables::<table_name> where should be replaced with the name of the table. For the sodium_consumption derived variable, the entry in the variables sheet would be,

variable	typeEnd	databaseStart	variableStart	typeStart	recEnd	numValidCategories	recStart	catLabel	catLabelLong
sodium_consumption	cont	database_one	DerivedVar::[age, sex, tables::sodium_reference_table]	N/A	Func::sodium_consumption	N/A	N/A

When using the rec_with_table function, the tables argument can be used to include all the tables mentioned in the variables sheet.