Use recodeflow with your data

Example data

Our examples use the dataset pbc from the package survival. We’ve split this dataset into two (tester1 and tester2) to mimic real data e.g., the same survey preformed in separate years. We’ve also added columns (agegrp5 and agegrp10) to this dataset for our examples.

We’ll use our example datasets and the variable stage. The variable stage captures the histologic stage of the patient’s disease (primary biliary cirrhosis (PBC)).

How to create the `variable_details` worksheet

The variable_details worksheet does two important steps. First, it maps variables across datasets. Second, it gives instructions on how to recode the variables.

Note: additional information for the variable_details worksheet is in the article variable_details.

Note: additional details on how to add derived variables to the var_detials worksheet is in the article derived_variables.

Rows

For the stage variable, there are the following six rows:

4 rows for the 4 categories for the histological stage of the disease,
1 row for the not applicable responses, and
1 row for else.

Since stage is coded consistently across the two datasets (tester1 and tester2) we only need one row per categorical response.

Columns (14)

variable: the name of the final recoded variable. We’ll use the same name as in the original dataset stage though you could rename the variable.

Write stage in the column variable in the six rows.

variable
stage
stage
stage
stage
stage
stage

typeEnd: stage captures the stage of the disease and is originally a categorical variable. Therefore, it does not make sense to change stage into a continuous variable.

Write ‘cat’ in the six rows.

	variable	typeEnd
58	stage	cat
59	stage	cat
60	stage	cat
61	stage	cat
62	stage	cat
63	stage	cat

typeStart: stage is originally a categorical variable.

Write ‘cat’ in the six rows.

	variable	typeEnd	typeStart
58	stage	cat	cat
59	stage	cat	cat
60	stage	cat	cat
61	stage	cat	cat
62	stage	cat	cat
63	stage	cat	cat

databaseStart: stage was captured identically for both datasets. Therefore, we will recode the variable the same way regardless of which dataset it’s from.

Write the dataset names, separated by a comma, in the six rows

	variable	typeEnd	typeStart	databaseStart
58	stage	cat	cat	tester1, tester2
59	stage	cat	cat	tester1, tester2
60	stage	cat	cat	tester1, tester2
61	stage	cat	cat	tester1, tester2
62	stage	cat	cat	tester1, tester2
63	stage	cat	cat	tester1, tester2

variableStart: The original variable stage is identical across our datasets.

Write the variable name in squared brackets once per row, for all six rows.

	variable	typeEnd	typeStart	databaseStart	variableStart
58	stage	cat	cat	tester1, tester2	[stage]
59	stage	cat	cat	tester1, tester2	[stage]
60	stage	cat	cat	tester1, tester2	[stage]
61	stage	cat	cat	tester1, tester2	[stage]
62	stage	cat	cat	tester1, tester2	[stage]
63	stage	cat	cat	tester1, tester2	[stage]

variableStartLabel:

Write “stage” in the 6 rows.

	variable	typeEnd	typeStart	databaseStart	variableStart	variableStartLabel
58	stage	cat	cat	tester1, tester2	[stage]	stage
59	stage	cat	cat	tester1, tester2	[stage]	stage
60	stage	cat	cat	tester1, tester2	[stage]	stage
61	stage	cat	cat	tester1, tester2	[stage]	stage
62	stage	cat	cat	tester1, tester2	[stage]	stage
63	stage	cat	cat	tester1, tester2	[stage]	stage

numValidCat: The number of categories within the variable stage. There are four categories in stage: 1, 2, 3, and 4. Note that for categories: not applicable, missing, and else, are not included in the category count.

Write 4 in each of the six rows.

	variable	typeEnd	typeStart	databaseStart	variableStart	variableStartLabel	numValidCat
58	stage	cat	cat	tester1, tester2	[stage]	stage	4
59	stage	cat	cat	tester1, tester2	[stage]	stage	4
60	stage	cat	cat	tester1, tester2	[stage]	stage	4
61	stage	cat	cat	tester1, tester2	[stage]	stage	4
62	stage	cat	cat	tester1, tester2	[stage]	stage	4
63	stage	cat	cat	tester1, tester2	[stage]	stage	4

recEnd: Since stage is a categorical variable, you’ll capture the category you are recoding each row too. For the not applicable rows NA::a is written. For the missing and else rows NA::b is written. The haven package is used for tagging NA in numeric variables.

We are not changing the categories. Therefore, the four stages captured in the original data. For the not applicable rows write NA::a. For the else rows write NA::b.

	variable	typeEnd	typeStart	databaseStart	variableStart	variableStartLabel	numValidCat	recEnd
58	stage	cat	cat	tester1, tester2	[stage]	stage	4	1
59	stage	cat	cat	tester1, tester2	[stage]	stage	4	2
60	stage	cat	cat	tester1, tester2	[stage]	stage	4	3
61	stage	cat	cat	tester1, tester2	[stage]	stage	4	4
62	stage	cat	cat	tester1, tester2	[stage]	stage	4	Na::a
63	stage	cat	cat	tester1, tester2	[stage]	stage	4	NA::b

catLabel: Name the categorical level for the recoded variable.

Write Stage 1, Stage 2, Stage 3, Stage 4, NA, and missing.

	variable	typeEnd	typeStart	databaseStart	variableStart	variableStartLabel	numValidCat	recEnd	catLabel
58	stage	cat	cat	tester1, tester2	[stage]	stage	4	1	stage 1
59	stage	cat	cat	tester1, tester2	[stage]	stage	4	2	stage 2
60	stage	cat	cat	tester1, tester2	[stage]	stage	4	3	stage 3
61	stage	cat	cat	tester1, tester2	[stage]	stage	4	4	stage 4
62	stage	cat	cat	tester1, tester2	[stage]	stage	4	Na::a	NA
63	stage	cat	cat	tester1, tester2	[stage]	stage	4	NA::b	missing

catLabelLong: You can elaborate on the name of the categorical level of the recoded variable.

Copy values from catLabel

	variable	typeEnd	typeStart	databaseStart	variableStart	variableStartLabel	numValidCat	recEnd	catLabel	catLabelLong
58	stage	cat	cat	tester1, tester2	[stage]	stage	4	1	stage 1	stage 1
59	stage	cat	cat	tester1, tester2	[stage]	stage	4	2	stage 2	stage 2
60	stage	cat	cat	tester1, tester2	[stage]	stage	4	3	stage 3	stage 3
61	stage	cat	cat	tester1, tester2	[stage]	stage	4	4	stage 4	stage 4
62	stage	cat	cat	tester1, tester2	[stage]	stage	4	Na::a	NA	NA
63	stage	cat	cat	tester1, tester2	[stage]	stage	4	NA::b	missing	missing

units: The histologic stage of disease does not have units of measurement.

Write “N/A” in all six rows.

	variable	typeEnd	typeStart	databaseStart	variableStart	variableStartLabel	numValidCat	recEnd	catLabel	catLabelLong	units
58	stage	cat	cat	tester1, tester2	[stage]	stage	4	1	stage 1	stage 1	N/A
59	stage	cat	cat	tester1, tester2	[stage]	stage	4	2	stage 2	stage 2	N/A
60	stage	cat	cat	tester1, tester2	[stage]	stage	4	3	stage 3	stage 3	N/A
61	stage	cat	cat	tester1, tester2	[stage]	stage	4	4	stage 4	stage 4	N/A
62	stage	cat	cat	tester1, tester2	[stage]	stage	4	Na::a	NA	NA	N/A
63	stage	cat	cat	tester1, tester2	[stage]	stage	4	NA::b	missing	missing	N/A

recStart: Since we are not combining levels of categories and we are keeping the category levels the same, the recStart column will be identical to recEnd.

Write the category level you are recoding each row too. For the not applicable rows NA::a is written. For the missing and else rows NA::b is written.

	variable	typeEnd	typeStart	databaseStart	variableStart	variableStartLabel	numValidCat	recEnd	catLabel	catLabelLong	units	recStart
58	stage	cat	cat	tester1, tester2	[stage]	stage	4	1	stage 1	stage 1	N/A	1
59	stage	cat	cat	tester1, tester2	[stage]	stage	4	2	stage 2	stage 2	N/A	2
60	stage	cat	cat	tester1, tester2	[stage]	stage	4	3	stage 3	stage 3	N/A	3
61	stage	cat	cat	tester1, tester2	[stage]	stage	4	4	stage 4	stage 4	N/A	4
62	stage	cat	cat	tester1, tester2	[stage]	stage	4	Na::a	NA	NA	N/A	Na
63	stage	cat	cat	tester1, tester2	[stage]	stage	4	NA::b	missing	missing	N/A	else

catStartLabel: The stage label should be identical to what is shown in the original data documentation. For the missing rows, each missing category is described along with their coded values.

Write Stage 1, Stage 2, Stage 3, Stage 4, NA, and missing.

	variable	typeEnd	typeStart	databaseStart	variableStart	variableStartLabel	numValidCat	recEnd	catLabel	catLabelLong	units	recStart	catStartLabel
58	stage	cat	cat	tester1, tester2	[stage]	stage	4	1	stage 1	stage 1	N/A	1	stage 1
59	stage	cat	cat	tester1, tester2	[stage]	stage	4	2	stage 2	stage 2	N/A	2	stage 2
60	stage	cat	cat	tester1, tester2	[stage]	stage	4	3	stage 3	stage 3	N/A	3	stage 3
61	stage	cat	cat	tester1, tester2	[stage]	stage	4	4	stage 4	stage 4	N/A	4	stage 4
62	stage	cat	cat	tester1, tester2	[stage]	stage	4	Na::a	NA	NA	N/A	Na	N/A
63	stage	cat	cat	tester1, tester2	[stage]	stage	4	NA::b	missing	missing	N/A	else	missing

notes: Capture any important differences in a variable across datasets. For our example, there are no differences across datasets.

Write “This is sample survival pbc data” in all six rows.

	variable	typeEnd	typeStart	databaseStart	variableStart	variableStartLabel	numValidCat	recEnd	catLabel	catLabelLong	units	recStart	catStartLabel	notes
58	stage	cat	cat	tester1, tester2	[stage]	stage	4	1	stage 1	stage 1	N/A	1	stage 1	This is sample survival pbc data
59	stage	cat	cat	tester1, tester2	[stage]	stage	4	2	stage 2	stage 2	N/A	2	stage 2	This is sample survival pbc data
60	stage	cat	cat	tester1, tester2	[stage]	stage	4	3	stage 3	stage 3	N/A	3	stage 3	This is sample survival pbc data
61	stage	cat	cat	tester1, tester2	[stage]	stage	4	4	stage 4	stage 4	N/A	4	stage 4	This is sample survival pbc data
62	stage	cat	cat	tester1, tester2	[stage]	stage	4	Na::a	NA	NA	N/A	Na	N/A	This is sample survival pbc data
63	stage	cat	cat	tester1, tester2	[stage]	stage	4	NA::b	missing	missing	N/A	else	missing	This is sample survival pbc data

`variable_details` for dervived variables

The same naming convention applies to derived variables with the exception of two columns:

In variableStart, instead of database names being listed, DerivedVar:: is written followed with the list of original variables used inside square brackets.

DerivedVar::[var1, var2, var3]

In recEnd, write Func:: followed with the exact name of the custom function used to create the derived variable.

Func::derivedFunction

A derived variable looks like this in variable_details.csv

	variable	typeEnd	typeStart	databaseStart	variableStart	variableStartLabel	numValidCat	recEnd	catLabel	catLabelLong	units	recStart	catStartLabel	notes
64	example_der	cont	cont	tester1, tester2	DerivedVar::[chol, bili]	example_der	N/A	Func::example_der_fun	N/A	N/A	mg/dl	else	N/A	This is sample survival pbc data

How to create the variables worksheet `variables_sheet`

Once mapped and specified on variable_details, the stage variable can be specified on the variables worksheet: variables_sheet. Ensure that the names you used in the variable_details worksheet are identical to those listed in variables_sheet.

variable: the name of the recoded variable (variable_details worksheet column: ‘variable’)
label: the shorthand label for the variable (variable_details worksheet column: ‘variableStartShortLabel’)
labelLong: a more detailed label for the variable (new column, it is not in the variable_details worksheet)
section: group of data the variable belongs to (new column, it is not in the variable_details worksheet)
subject: sub-group of data the variable belongs to (new column, it is not in the variable_details worksheet)
variableType: whether the recoded variable is categorical or continuous (variable_detail worksheet column: ‘typeEnd’)
units: any units for the variable (variable_detail worksheet column ‘units’)
databaseStart: the list of databases that contain the variable of interest (variable_detail worksheet column: ‘databaseStart’)
variableStart the original variable name (variable_detail worksheet column: ‘variableStart’)

	variable	label	labelLong	subject	section	variableType	databaseStart	units	variableStart
19	stage	stage	histologic stage of disease	lab test	stage	cat	tester1, tester2	N/A	[stage]

How to use `recodeflow` with your data

Example data

How to create the `variable_details` worksheet

Rows

Columns (14)

`variable_details` for dervived variables

How to create the variables worksheet `variables_sheet`

Contents

Use recodeflow with your data

How to use recodeflow with your data

Example data

How to create the variable_details worksheet

Rows

Columns (14)

variable_details for dervived variables

How to create the variables worksheet variables_sheet

Contents

How to use `recodeflow` with your data

How to create the `variable_details` worksheet

`variable_details` for dervived variables

How to create the variables worksheet `variables_sheet`