To use recodeflow with your data, you’ll need create the following worksheets:

  • variable_details - mapping of variables across datasets and a list of instructions on how to recode variables, and
  • variables - a list of variables to recode

Files can either be a .csv file that you import to R or a dataframe created directly in R.

If you have derived variables, you will also need to create custom functions. For more information on custom functions see the article derived variables.

Example data

Our examples use the dataset pbc from the package survival. We’ve split this dataset in two (tester1 and tester2) to mimic real data e.g., the same survey preformed in separate years. We’ve also added columns (agegrp5 and agegrp10) to this dataset for our examples.

We’ll use our example datasets and the variable stage. The variable stage captures the histologic stage of the patient’s disease (primary biliary cirrhosis (PBC)).

How to create the variable_details worksheet

The variable_details worksheet does two important steps. First, it maps variables across datasets. Second, it gives instructions on how to recode the variables.

Note: additional information for the variable_details worksheet is in the article variable_details.

Note: additional details on how to add derived variables to the variable_detials worksheet is in the article derived variables.

Rows

For the stage variable, there are the following six rows:

  • 4 rows for the 4 categories for the histological stage of the disease,
  • 1 row for the not applicable responses, and
  • 1 row for else.

Since stage is coded consistently across the two datasets (tester1 and tester2) we only need one row per categorical response.

Columns (14)

  1. variable: the name of the final recoded variable. We’ll use the same name as in the original dataset stage though you could rename the variable.

Write stage in the column variable in the six rows.

variable
stage
stage
stage
stage
stage
stage
  1. typeEnd: indicates the type of variable (continous or categorical)for the recoded (final) variable. stage, which captures the stage of the disease, is a categorical variable in the original dataset and will remain a categorical variable after recoding.

Write ‘cat’ in the six rows.

variable typeEnd
58 stage cat
59 stage cat
60 stage cat
61 stage cat
62 stage cat
63 stage cat
  1. typeStart: indicates the type of variable (continous or categorical) for the original variable. stage is a categorical variable in the original dataset.

Write ‘cat’ in the six rows.

variable typeEnd typeStart
58 stage cat cat
59 stage cat cat
60 stage cat cat
61 stage cat cat
62 stage cat cat
63 stage cat cat
  1. databaseStart: indicates the name of the database(s) from which the original variable(s) is(are) obtained.

Write the dataset names, separated by a comma, in the six rows

variable typeEnd typeStart databaseStart
58 stage cat cat tester1, tester2
59 stage cat cat tester1, tester2
60 stage cat cat tester1, tester2
61 stage cat cat tester1, tester2
62 stage cat cat tester1, tester2
63 stage cat cat tester1, tester2
  1. variableStart: indicates the original variable name(s) in the database(s). In our example, both datasets have the start variable stage. Therefore we can indicate a single variable name here. If the variable names were different, we would need to indicate ‘dataset_name::variable_name’ separated by commas for each of the datasets.

Write the variable name in squared brackets once per row, for all six rows.

variable typeEnd typeStart databaseStart variableStart
58 stage cat cat tester1, tester2 [stage]
59 stage cat cat tester1, tester2 [stage]
60 stage cat cat tester1, tester2 [stage]
61 stage cat cat tester1, tester2 [stage]
62 stage cat cat tester1, tester2 [stage]
63 stage cat cat tester1, tester2 [stage]
  1. variableStartLabel: indicates the original variable lable

Write “stage” in the 6 rows.

variable typeEnd typeStart databaseStart variableStart variableStartLabel
58 stage cat cat tester1, tester2 [stage] stage
59 stage cat cat tester1, tester2 [stage] stage
60 stage cat cat tester1, tester2 [stage] stage
61 stage cat cat tester1, tester2 [stage] stage
62 stage cat cat tester1, tester2 [stage] stage
63 stage cat cat tester1, tester2 [stage] stage
  1. numValidCat: indicates the number of valid categories for the final derived variable. In our example, there are four categories for stage: 1, 2, 3, and 4. Note that the categories ‘not applicable’, ‘missing’, and ‘else’ are not included in the category count.

Write 4 in each of the six rows.

variable typeEnd typeStart databaseStart variableStart variableStartLabel numValidCat
58 stage cat cat tester1, tester2 [stage] stage 4
59 stage cat cat tester1, tester2 [stage] stage 4
60 stage cat cat tester1, tester2 [stage] stage 4
61 stage cat cat tester1, tester2 [stage] stage 4
62 stage cat cat tester1, tester2 [stage] stage 4
63 stage cat cat tester1, tester2 [stage] stage 4
  1. recEnd: indicates the category to which you are recoding each row. For the not applicable rows NA::a is written. For the missing and else rows NA::b is written. The haven package is used for tagging NA in numeric variables.

We are not changing the categories stage, therefore, the recEnd values for these rows will be the same as the the original data. For the not applicable rows write NA::a. For the missing and else rows write NA::b.

variable typeEnd typeStart databaseStart variableStart variableStartLabel numValidCat recEnd
58 stage cat cat tester1, tester2 [stage] stage 4 1
59 stage cat cat tester1, tester2 [stage] stage 4 2
60 stage cat cat tester1, tester2 [stage] stage 4 3
61 stage cat cat tester1, tester2 [stage] stage 4 4
62 stage cat cat tester1, tester2 [stage] stage 4 Na::a
63 stage cat cat tester1, tester2 [stage] stage 4 NA::b
  1. catLabel: indicates the lable for the recoded categorical level.

Write Stage 1, Stage 2, Stage 3, Stage 4, NA, and missing.

variable typeEnd typeStart databaseStart variableStart variableStartLabel numValidCat recEnd catLabel
58 stage cat cat tester1, tester2 [stage] stage 4 1 stage 1
59 stage cat cat tester1, tester2 [stage] stage 4 2 stage 2
60 stage cat cat tester1, tester2 [stage] stage 4 3 stage 3
61 stage cat cat tester1, tester2 [stage] stage 4 4 stage 4
62 stage cat cat tester1, tester2 [stage] stage 4 Na::a NA
63 stage cat cat tester1, tester2 [stage] stage 4 NA::b missing
  1. catLabelLong: provides a more elaborate lable for the recoded categorical level. If not required, repeat the shorter catLabel.

Copy values from catLabel

variable typeEnd typeStart databaseStart variableStart variableStartLabel numValidCat recEnd catLabel catLabelLong
58 stage cat cat tester1, tester2 [stage] stage 4 1 stage 1 stage 1
59 stage cat cat tester1, tester2 [stage] stage 4 2 stage 2 stage 2
60 stage cat cat tester1, tester2 [stage] stage 4 3 stage 3 stage 3
61 stage cat cat tester1, tester2 [stage] stage 4 4 stage 4 stage 4
62 stage cat cat tester1, tester2 [stage] stage 4 Na::a NA NA
63 stage cat cat tester1, tester2 [stage] stage 4 NA::b missing missing
  1. units: indicates the unit of measure for the variable. The histologic stage of disease does not have a units of measurement.

Write “N/A” in all six rows.

variable typeEnd typeStart databaseStart variableStart variableStartLabel numValidCat recEnd catLabel catLabelLong units
58 stage cat cat tester1, tester2 [stage] stage 4 1 stage 1 stage 1 N/A
59 stage cat cat tester1, tester2 [stage] stage 4 2 stage 2 stage 2 N/A
60 stage cat cat tester1, tester2 [stage] stage 4 3 stage 3 stage 3 N/A
61 stage cat cat tester1, tester2 [stage] stage 4 4 stage 4 stage 4 N/A
62 stage cat cat tester1, tester2 [stage] stage 4 Na::a NA NA N/A
63 stage cat cat tester1, tester2 [stage] stage 4 NA::b missing missing N/A
  1. recStart: indicates the category(ies) from which you are recoding each row. Since we are not combining levels of categories and we are keeping the category levels the same, the recStart column will be identical to recEnd. If multiple categories were being combined into a single category, the original categories would be indicated in square brackets, separated by commas.

Write the category level you are recoding each row too. For the not applicable rows NA::a is written. For the missing and else rows NA::b is written.

variable typeEnd typeStart databaseStart variableStart variableStartLabel numValidCat recEnd catLabel catLabelLong units recStart
58 stage cat cat tester1, tester2 [stage] stage 4 1 stage 1 stage 1 N/A 1
59 stage cat cat tester1, tester2 [stage] stage 4 2 stage 2 stage 2 N/A 2
60 stage cat cat tester1, tester2 [stage] stage 4 3 stage 3 stage 3 N/A 3
61 stage cat cat tester1, tester2 [stage] stage 4 4 stage 4 stage 4 N/A 4
62 stage cat cat tester1, tester2 [stage] stage 4 Na::a NA NA N/A Na
63 stage cat cat tester1, tester2 [stage] stage 4 NA::b missing missing N/A else
  1. catStartLabel: indicates the original variable category label. The stage label should be identical to what is shown in the original data documentation. For the missing rows, each missing category is described along with their coded values.

Write Stage 1, Stage 2, Stage 3, Stage 4, NA, and missing.

variable typeEnd typeStart databaseStart variableStart variableStartLabel numValidCat recEnd catLabel catLabelLong units recStart catStartLabel
58 stage cat cat tester1, tester2 [stage] stage 4 1 stage 1 stage 1 N/A 1 stage 1
59 stage cat cat tester1, tester2 [stage] stage 4 2 stage 2 stage 2 N/A 2 stage 2
60 stage cat cat tester1, tester2 [stage] stage 4 3 stage 3 stage 3 N/A 3 stage 3
61 stage cat cat tester1, tester2 [stage] stage 4 4 stage 4 stage 4 N/A 4 stage 4
62 stage cat cat tester1, tester2 [stage] stage 4 Na::a NA NA N/A Na N/A
63 stage cat cat tester1, tester2 [stage] stage 4 NA::b missing missing N/A else missing
  1. notes: Capture any important differences in a variable across datasets. For our example, there are no differences across datasets.

Write “This is sample survival pbc data” in all six rows.

variable typeEnd typeStart databaseStart variableStart variableStartLabel numValidCat recEnd catLabel catLabelLong units recStart catStartLabel notes
58 stage cat cat tester1, tester2 [stage] stage 4 1 stage 1 stage 1 N/A 1 stage 1 This is sample survival pbc data
59 stage cat cat tester1, tester2 [stage] stage 4 2 stage 2 stage 2 N/A 2 stage 2 This is sample survival pbc data
60 stage cat cat tester1, tester2 [stage] stage 4 3 stage 3 stage 3 N/A 3 stage 3 This is sample survival pbc data
61 stage cat cat tester1, tester2 [stage] stage 4 4 stage 4 stage 4 N/A 4 stage 4 This is sample survival pbc data
62 stage cat cat tester1, tester2 [stage] stage 4 Na::a NA NA N/A Na N/A This is sample survival pbc data
63 stage cat cat tester1, tester2 [stage] stage 4 NA::b missing missing N/A else missing This is sample survival pbc data

variable_details for dervived variables

The same naming convention applies to derived variables with the exception of two columns:

  1. In variableStart, instead of database names being listed, DerivedVar:: is written followed with the list of original variables used inside square brackets.
  • DerivedVar::[var1, var2, var3]
  1. In recEnd, write Func:: followed with the exact name of the custom function used to create the derived variable.
  • Func::derivedFunction

A derived variable looks like this in variable_details.csv

variable typeEnd typeStart databaseStart variableStart variableStartLabel numValidCat recEnd catLabel catLabelLong units recStart catStartLabel notes
64 example_der cont cont tester1, tester2 DerivedVar::[chol, bili] example_der N/A Func::example_der_fun N/A N/A mg/dl else N/A This is sample survival pbc data

How to create the variables worksheet variables

Once mapped and specified on variable_details, the stage variable can be specified on the variables worksheet: variables. Ensure that the names you used in the variable_details worksheet are identical to those listed in variables.

  1. variable: the name of the recoded variable (variable_details worksheet column: ‘variable’)

  2. label: the shorthand label for the variable (variable_details worksheet column: ‘variableStartShortLabel’)

  3. labelLong: a more detailed label for the variable (new column, it is not in the variable_details worksheet)

  4. section: group of data to which the variable belongs (new column, it is not in the variable_details worksheet)

  5. subject: sub-group of data to which the variable belongs (new column, it is not in the variable_details worksheet)

  6. variableType: whether the recoded variable is categorical or continuous (variable_detail worksheet column: ‘typeEnd’)

  7. units: unit of measurement for the variable (variable_detail worksheet column ‘units’)

  8. databaseStart: the list of databases that contain the variable of interest (variable_detail worksheet column: ‘databaseStart’)

  9. variableStart the original variable name (variable_detail worksheet column: ‘variableStart’)

variable label labelLong subject section variableType databaseStart units variableStart
19 stage stage histologic stage of disease lab test stage cat tester1, tester2 N/A [stage]