recodeflow
with your dataTo use recodeflow
with your data, you’ll need create the following worksheets:
variable_details
- mapping of variables across datasets and a list of instructions on how to recode variablesvariables
- a list of variables to recode, andFiles can either be a .csv
file that you import to R or create directly in R as a dataframe.
If you have derived variables, you will also need to create custom functions. For more information on custom functions see the article derived_variables.
Our examples use the dataset pbc
from the package survival. We’ve split this dataset into two (tester1 and tester2) to mimic real data e.g., the same survey preformed in separate years. We’ve also added columns (agegrp5
and agegrp10
) to this dataset for our examples.
We’ll use our example datasets and the variable stage
. The variable stage
captures the histologic stage of the patient’s disease (primary biliary cirrhosis (PBC)).
variable_details
worksheetThe variable_details
worksheet does two important steps. First, it maps variables across datasets. Second, it gives instructions on how to recode the variables.
Note: additional information for the variable_details
worksheet is in the article variable_details.
Note: additional details on how to add derived variables to the var_detials
worksheet is in the article derived_variables.
For the stage
variable, there are the following six rows:
Since stage
is coded consistently across the two datasets (tester1 and tester2) we only need one row per categorical response.
stage
though you could rename the variable.Write stage
in the column variable in the six rows.
variable |
---|
stage |
stage |
stage |
stage |
stage |
stage |
stage
captures the stage of the disease and is originally a categorical variable. Therefore, it does not make sense to change stage
into a continuous variable.Write ‘cat’ in the six rows.
variable | typeEnd | |
---|---|---|
58 | stage | cat |
59 | stage | cat |
60 | stage | cat |
61 | stage | cat |
62 | stage | cat |
63 | stage | cat |
stage
is originally a categorical variable.Write ‘cat’ in the six rows.
variable | typeEnd | typeStart | |
---|---|---|---|
58 | stage | cat | cat |
59 | stage | cat | cat |
60 | stage | cat | cat |
61 | stage | cat | cat |
62 | stage | cat | cat |
63 | stage | cat | cat |
stage
was captured identically for both datasets. Therefore, we will recode the variable the same way regardless of which dataset it’s from.Write the dataset names, separated by a comma, in the six rows
variable | typeEnd | typeStart | databaseStart | |
---|---|---|---|---|
58 | stage | cat | cat | tester1, tester2 |
59 | stage | cat | cat | tester1, tester2 |
60 | stage | cat | cat | tester1, tester2 |
61 | stage | cat | cat | tester1, tester2 |
62 | stage | cat | cat | tester1, tester2 |
63 | stage | cat | cat | tester1, tester2 |
stage
is identical across our datasets.Write the variable name in squared brackets once per row, for all six rows.
variable | typeEnd | typeStart | databaseStart | variableStart | |
---|---|---|---|---|---|
58 | stage | cat | cat | tester1, tester2 | [stage] |
59 | stage | cat | cat | tester1, tester2 | [stage] |
60 | stage | cat | cat | tester1, tester2 | [stage] |
61 | stage | cat | cat | tester1, tester2 | [stage] |
62 | stage | cat | cat | tester1, tester2 | [stage] |
63 | stage | cat | cat | tester1, tester2 | [stage] |
Write “stage” in the 6 rows.
variable | typeEnd | typeStart | databaseStart | variableStart | variableStartLabel | |
---|---|---|---|---|---|---|
58 | stage | cat | cat | tester1, tester2 | [stage] | stage |
59 | stage | cat | cat | tester1, tester2 | [stage] | stage |
60 | stage | cat | cat | tester1, tester2 | [stage] | stage |
61 | stage | cat | cat | tester1, tester2 | [stage] | stage |
62 | stage | cat | cat | tester1, tester2 | [stage] | stage |
63 | stage | cat | cat | tester1, tester2 | [stage] | stage |
stage
. There are four categories in stage
: 1, 2, 3, and 4. Note that for categories: not applicable, missing, and else, are not included in the category count.Write 4 in each of the six rows.
variable | typeEnd | typeStart | databaseStart | variableStart | variableStartLabel | numValidCat | |
---|---|---|---|---|---|---|---|
58 | stage | cat | cat | tester1, tester2 | [stage] | stage | 4 |
59 | stage | cat | cat | tester1, tester2 | [stage] | stage | 4 |
60 | stage | cat | cat | tester1, tester2 | [stage] | stage | 4 |
61 | stage | cat | cat | tester1, tester2 | [stage] | stage | 4 |
62 | stage | cat | cat | tester1, tester2 | [stage] | stage | 4 |
63 | stage | cat | cat | tester1, tester2 | [stage] | stage | 4 |
stage
is a categorical variable, you’ll capture the category you are recoding each row too. For the not applicable rows NA::a
is written. For the missing and else rows NA::b
is written. The haven
package is used for tagging NA in numeric variables.We are not changing the categories. Therefore, the four stages captured in the original data. For the not applicable rows write NA::a
. For the else rows write NA::b
.
variable | typeEnd | typeStart | databaseStart | variableStart | variableStartLabel | numValidCat | recEnd | |
---|---|---|---|---|---|---|---|---|
58 | stage | cat | cat | tester1, tester2 | [stage] | stage | 4 | 1 |
59 | stage | cat | cat | tester1, tester2 | [stage] | stage | 4 | 2 |
60 | stage | cat | cat | tester1, tester2 | [stage] | stage | 4 | 3 |
61 | stage | cat | cat | tester1, tester2 | [stage] | stage | 4 | 4 |
62 | stage | cat | cat | tester1, tester2 | [stage] | stage | 4 | Na::a |
63 | stage | cat | cat | tester1, tester2 | [stage] | stage | 4 | NA::b |
Write Stage 1, Stage 2, Stage 3, Stage 4, NA, and missing.
variable | typeEnd | typeStart | databaseStart | variableStart | variableStartLabel | numValidCat | recEnd | catLabel | |
---|---|---|---|---|---|---|---|---|---|
58 | stage | cat | cat | tester1, tester2 | [stage] | stage | 4 | 1 | stage 1 |
59 | stage | cat | cat | tester1, tester2 | [stage] | stage | 4 | 2 | stage 2 |
60 | stage | cat | cat | tester1, tester2 | [stage] | stage | 4 | 3 | stage 3 |
61 | stage | cat | cat | tester1, tester2 | [stage] | stage | 4 | 4 | stage 4 |
62 | stage | cat | cat | tester1, tester2 | [stage] | stage | 4 | Na::a | NA |
63 | stage | cat | cat | tester1, tester2 | [stage] | stage | 4 | NA::b | missing |
Copy values from catLabel
variable | typeEnd | typeStart | databaseStart | variableStart | variableStartLabel | numValidCat | recEnd | catLabel | catLabelLong | |
---|---|---|---|---|---|---|---|---|---|---|
58 | stage | cat | cat | tester1, tester2 | [stage] | stage | 4 | 1 | stage 1 | stage 1 |
59 | stage | cat | cat | tester1, tester2 | [stage] | stage | 4 | 2 | stage 2 | stage 2 |
60 | stage | cat | cat | tester1, tester2 | [stage] | stage | 4 | 3 | stage 3 | stage 3 |
61 | stage | cat | cat | tester1, tester2 | [stage] | stage | 4 | 4 | stage 4 | stage 4 |
62 | stage | cat | cat | tester1, tester2 | [stage] | stage | 4 | Na::a | NA | NA |
63 | stage | cat | cat | tester1, tester2 | [stage] | stage | 4 | NA::b | missing | missing |
Write “N/A” in all six rows.
variable | typeEnd | typeStart | databaseStart | variableStart | variableStartLabel | numValidCat | recEnd | catLabel | catLabelLong | units | |
---|---|---|---|---|---|---|---|---|---|---|---|
58 | stage | cat | cat | tester1, tester2 | [stage] | stage | 4 | 1 | stage 1 | stage 1 | N/A |
59 | stage | cat | cat | tester1, tester2 | [stage] | stage | 4 | 2 | stage 2 | stage 2 | N/A |
60 | stage | cat | cat | tester1, tester2 | [stage] | stage | 4 | 3 | stage 3 | stage 3 | N/A |
61 | stage | cat | cat | tester1, tester2 | [stage] | stage | 4 | 4 | stage 4 | stage 4 | N/A |
62 | stage | cat | cat | tester1, tester2 | [stage] | stage | 4 | Na::a | NA | NA | N/A |
63 | stage | cat | cat | tester1, tester2 | [stage] | stage | 4 | NA::b | missing | missing | N/A |
Write the category level you are recoding each row too. For the not applicable rows NA::a
is written. For the missing and else rows NA::b
is written.
variable | typeEnd | typeStart | databaseStart | variableStart | variableStartLabel | numValidCat | recEnd | catLabel | catLabelLong | units | recStart | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
58 | stage | cat | cat | tester1, tester2 | [stage] | stage | 4 | 1 | stage 1 | stage 1 | N/A | 1 |
59 | stage | cat | cat | tester1, tester2 | [stage] | stage | 4 | 2 | stage 2 | stage 2 | N/A | 2 |
60 | stage | cat | cat | tester1, tester2 | [stage] | stage | 4 | 3 | stage 3 | stage 3 | N/A | 3 |
61 | stage | cat | cat | tester1, tester2 | [stage] | stage | 4 | 4 | stage 4 | stage 4 | N/A | 4 |
62 | stage | cat | cat | tester1, tester2 | [stage] | stage | 4 | Na::a | NA | NA | N/A | Na |
63 | stage | cat | cat | tester1, tester2 | [stage] | stage | 4 | NA::b | missing | missing | N/A | else |
stage
label should be identical to what is shown in the original data documentation. For the missing rows, each missing category is described along with their coded values.Write Stage 1, Stage 2, Stage 3, Stage 4, NA, and missing.
variable | typeEnd | typeStart | databaseStart | variableStart | variableStartLabel | numValidCat | recEnd | catLabel | catLabelLong | units | recStart | catStartLabel | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
58 | stage | cat | cat | tester1, tester2 | [stage] | stage | 4 | 1 | stage 1 | stage 1 | N/A | 1 | stage 1 |
59 | stage | cat | cat | tester1, tester2 | [stage] | stage | 4 | 2 | stage 2 | stage 2 | N/A | 2 | stage 2 |
60 | stage | cat | cat | tester1, tester2 | [stage] | stage | 4 | 3 | stage 3 | stage 3 | N/A | 3 | stage 3 |
61 | stage | cat | cat | tester1, tester2 | [stage] | stage | 4 | 4 | stage 4 | stage 4 | N/A | 4 | stage 4 |
62 | stage | cat | cat | tester1, tester2 | [stage] | stage | 4 | Na::a | NA | NA | N/A | Na | N/A |
63 | stage | cat | cat | tester1, tester2 | [stage] | stage | 4 | NA::b | missing | missing | N/A | else | missing |
Write “This is sample survival pbc data” in all six rows.
variable | typeEnd | typeStart | databaseStart | variableStart | variableStartLabel | numValidCat | recEnd | catLabel | catLabelLong | units | recStart | catStartLabel | notes | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
58 | stage | cat | cat | tester1, tester2 | [stage] | stage | 4 | 1 | stage 1 | stage 1 | N/A | 1 | stage 1 | This is sample survival pbc data |
59 | stage | cat | cat | tester1, tester2 | [stage] | stage | 4 | 2 | stage 2 | stage 2 | N/A | 2 | stage 2 | This is sample survival pbc data |
60 | stage | cat | cat | tester1, tester2 | [stage] | stage | 4 | 3 | stage 3 | stage 3 | N/A | 3 | stage 3 | This is sample survival pbc data |
61 | stage | cat | cat | tester1, tester2 | [stage] | stage | 4 | 4 | stage 4 | stage 4 | N/A | 4 | stage 4 | This is sample survival pbc data |
62 | stage | cat | cat | tester1, tester2 | [stage] | stage | 4 | Na::a | NA | NA | N/A | Na | N/A | This is sample survival pbc data |
63 | stage | cat | cat | tester1, tester2 | [stage] | stage | 4 | NA::b | missing | missing | N/A | else | missing | This is sample survival pbc data |
variable_details
for dervived variablesThe same naming convention applies to derived variables with the exception of two columns:
DerivedVar::[var1, var2, var3]
Func::derivedFunction
A derived variable looks like this in variable_details.csv
variable | typeEnd | typeStart | databaseStart | variableStart | variableStartLabel | numValidCat | recEnd | catLabel | catLabelLong | units | recStart | catStartLabel | notes | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
64 | example_der | cont | cont | tester1, tester2 | DerivedVar::[chol, bili] | example_der | N/A | Func::example_der_fun | N/A | N/A | mg/dl | else | N/A | This is sample survival pbc data |
variables_sheet
Once mapped and specified on variable_details
, the stage
variable can be specified on the variables worksheet: variables_sheet
. Ensure that the names you used in the variable_details
worksheet are identical to those listed in variables_sheet
.
variable: the name of the recoded variable (variable_details worksheet column: ‘variable’)
label: the shorthand label for the variable (variable_details worksheet column: ‘variableStartShortLabel’)
labelLong: a more detailed label for the variable (new column, it is not in the variable_details worksheet)
section: group of data the variable belongs to (new column, it is not in the variable_details worksheet)
subject: sub-group of data the variable belongs to (new column, it is not in the variable_details worksheet)
variableType: whether the recoded variable is categorical or continuous (variable_detail worksheet column: ‘typeEnd’)
units: any units for the variable (variable_detail worksheet column ‘units’)
databaseStart: the list of databases that contain the variable of interest (variable_detail worksheet column: ‘databaseStart’)
variableStart the original variable name (variable_detail worksheet column: ‘variableStart’)
variable | label | labelLong | subject | section | variableType | databaseStart | units | variableStart | |
---|---|---|---|---|---|---|---|---|---|
19 | stage | stage | histologic stage of disease | lab test | stage | cat | tester1, tester2 | N/A | [stage] |