Recode with Table — rec_with_table • recodeflow

Creates new variables by recoding variables in a dataset using the rules specified in a variables details sheet

rec_with_table(
  data,
  variables = NULL,
  database_name = NULL,
  variable_details = NULL,
  else_value = NA,
  append_to_data = FALSE,
  log = FALSE,
  notes = TRUE,
  var_labels = NULL,
  custom_function_path = NULL,
  attach_data_name = FALSE,
  id_role_name = NULL,
  name_of_environment_to_load = NULL,
  append_non_db_columns = FALSE,
  tables = list()
)

Arguments

data: A dataframe containing the variables to be recoded. Can also be a named list of dataframes.
variables: Character vector containing the names of the new variables to recode to or a dataframe containing a variables sheet.
database_name: A String containing the name of the database containing the original variables which should match up with a database from the databaseStart column in the variables details sheet. Should be a character vector if data is a named list where each vector item matches a name in the data list and also matches with a value in the databaseStart column of a variable details sheet.
variable_details: A dataframe containing the specifications for recoding.
else_value: Value (string, number, integer, logical or NA) that is used to replace any values that are outside the specified ranges (no rules for recoding).
append_to_data: Logical, if TRUE (default), the newly created variables will be appended to the original dataset.
log: Logical, if FALSE (default), a log containing information about the recoding will not be printed.
notes: Logical, if FALSE (default), will not print the content inside the `Note“ column of the variable being recoded.
var_labels: labels vector to attach to variables in variables
custom_function_path: string containing the path to the file containing functions to run for derived variables. This file will be sourced and its functions loaded into the R environment.
attach_data_name: logical to attach name of database to end table
id_role_name: name for the role to be used to generate id column
name_of_environment_to_load: Name of package to load variables and variable_details from
append_non_db_columns: boolean determening if data not present in this cycle should be appended as NA
tables: named list of data.frame A list of reference tables that can be passed as parameters into the function for a derived variable

Value

a dataframe that is recoded according to rules in variable_details.

Details

The variable_details dataframe needs the following columns:

variable: Name of the new variable created. The name of the new variable can be the same as the original variable if it does not change the original variable definition
toType: type the new variable cat = categorical, cont = continuous
databaseStart: Names of the databases that the original variable can come from. Each database name should be seperated by a comma. For eg., "cchs2001_p, cchs2003_p,cchs2005_p,cchs2007_p"
variableStart: Names of the original variables within each database specified in the databaseStart column. For eg. , "cchs2001_p::RACA_6A,cchs2003_p::RACC_6A,ADL_01". The final variable specified is the name of the variable for all other databases specified in databaseStart but not in this column. For eg., ADL_01 would be the original variable name in the cchs2005_p and cchs2007_p databases.
fromType: variable type of start variable. cat = categorical or factor variable cont = continuous variable (real number or integer)
recTo: Value to recode to
recFrom: Value/range being recoded from

Each row in the variables details sheet encodes the rule for recoding value(s) of the original variable to a category in the new variable. The categories of the new variable are encoded in the recTo column and the value(s) of the original variable that recode to this new value are encoded in the recFrom column. These recode columns follow a syntax similar to the sjmisc::rec() function. Whereas in the sjmisc::rec() function the recoding rules are in one string, in the variables details sheet they are encoded over multiple rows and columns (recFrom an recTo). For eg., a recoding rule in the sjmisc function would like like "1=2;2=3" whereas in the variables details sheet this would be encoded over two rows with recFrom and recTo values of the first row being 1 and 2 and similarly for the second row it would be 2 and 3. The rules for describing recoding pairs are shown below:

recode pairs: Each recode pair is a row
multiple values: Multiple values from the old variable that should be recoded into a new category of the new variable should be separated with a comma. e.g., recFrom = "1,2"; recTo = 1
value range: A value range is indicated by a colon, e.g. recFrom= "1:4"; recTo = 1 will recode all values from 1 to 4 into 1
min and max: minimum and maximum values are indicated by min (or lo) and max (or hi), e.g. recFrom = "min:4"; recTo = 1 will recode all values from the minimum value of the original variable to 4 into 1
"else": All other values, which have not been specified yet, are indicated by else, e.g. recFrom = "else"; recTo = NA will recode all other values (not specified in other rows) of the original variable to "NA")
"copy": the else token can be combined with copy, indicating that all remaining, not yet recoded values should stay the same (are copied from the original value), e.g. recFrom = "else"; recTo = "copy"
NA's: NA values are allowed both for the original and the new variable, e.g. recFrom "NA"; recTo = 1. or "recFrom = "3:5"; recTo = "NA" (recodes all NA into 1, and all values from 3 to 5 into NA in the new variable)

Examples

var_details <-
  data.frame(
    "variable" = c("time", rep("status", times = 3), rep("trt", times = 2), "age", rep("sex", times = 2), rep("ascites", times = 2), rep("hepato", times = 2), rep("spiders", times = 2), rep("edema", times = 3), "bili", "chol", "albumin", "copper", "alk.phos", "ast", "trig", "platelet", "protime", rep("stage", times = 4)),
    "dummyVariable" = c("NA", "status0", "status1","status2", "trt1","trt2","NA","sexM","sexF", "ascites0", "ascites1","hepato0","hepato1","spiders0","spiders1","edema0.0","edema0.5","edema1.0",rep("NA",times = 9), "stage1", "stage2","stage3","stage4"),
    "typeEnd" = c("cont", rep("cat", times = 3), rep("cat", times = 2), "cont", rep("cat", times = 2), rep("cat", times = 2), rep("cat", times = 2),rep("cat", times = 2), rep("cat", times = 3), rep("cont", times = 9), rep("cat", times = 4)),
    "databaseStart" = rep("tester1, tester2", times = 31),
    "variableStart" = c("[time]", rep("[status]", times = 3), rep("[trt]", times = 2), "[age]", rep("[sex]", times = 2), rep("[ascites]", times = 2), rep("[hepato]", times = 2), rep("[spiders]", times = 2), rep("[edema]", times = 3), "[bili]", "[chol]", "[albumin]", "[copper]", "[alk.phos]", "[ast]", "[trig]", "[platelet]", "[protime]", rep("[stage]", times = 4)),
    "typeStart" = c("cont", rep("cat", times = 3), rep("cat", times = 2), "cont", rep("cat", times = 2), rep("cat", times = 2), rep("cat", times = 2),rep("cat", times = 2), rep("cat", times = 3), rep("cont", times = 9), rep("cat", times = 4)),
    "recEnd" = c("copy", "0", "1","2", "1","2","copy","m","f", "0", "1","0","1","0","1","0.0","0.5","1.0",rep("copy",times = 9), "1", "2","3","4"),
    "catLabel" = c("", "status 0", "status 1","status 2", "trt 1","trt 2","","sex m","sex f", "ascites 0", "ascites 1","hepato 0","hepato 1","spiders 0","spiders 1","edema 0.0","edema 0.5","edema 1.0",rep("",times = 9), "stage 1", "stage 2","stage 3","stage 4"),
    "catLabelLong" = c("", "status 0", "status 1","status 2", "trt 1","trt 2","","sex m","sex f", "ascites 0", "ascites 1","hepato 0","hepato 1","spiders 0","spiders 1","edema 0.0","edema 0.5","edema 1.0",rep("",times = 9), "stage 1", "stage 2","stage 3","stage 4"),
    "recStart" = c("else", "0", "1","2", "1","2","else","m","f", "0", "1","0","1","0","1","0.0","0.5","1.0",rep("else",times = 9), "1", "2","3","4"),
    "catStartLabel" = c("", "status 0", "status 1","status 2", "trt 1","trt 2","","sex m","sex f", "ascites 0", "ascites 1","hepato 0","hepato 1","spiders 0","spiders 1","edema 0.0","edema 0.5","edema 1.0",rep("",times = 9), "stage 1", "stage 2","stage 3","stage 4"),
    "variableStartShortLabel" = c("time", rep("status", times = 3), rep("trt", times = 2), "age", rep("sex", times = 2), rep("ascites", times = 2), rep("hepato", times = 2), rep("spiders", times = 2), rep("edema", times = 3), "bili", "chol", "albumin", "copper", "alk.phos", "ast", "trig", "platelet", "protime", rep("stage", times = 4)),
    "variableStartLabel" = c("time", rep("status", times = 3), rep("trt", times = 2), "age", rep("sex", times = 2), rep("ascites", times = 2), rep("hepato", times = 2), rep("spiders", times = 2), rep("edema", times = 3), "bili", "chol", "albumin", "copper", "alk.phos", "ast", "trig", "platelet", "protime", rep("stage", times = 4)),
    "units" = rep("NA", times = 31),
    "notes" = rep("This is sample survival pbc data", times = 31)
  )
var_sheet <-
  data.frame(
    "variable" = c("time","status","trt", "age","sex","ascites","hepato", "spiders", "edema", "bili", "chol", "albumin", "copper", "alk.phos", "ast", "trig", "platelet", "protime", "stage"),
    "label" = c("time","status","trt", "age","sex","ascites","hepato", "spiders", "edema", "bili", "chol", "albumin", "copper", "alk.phos", "ast", "trig", "platelet", "protime", "stage"),
    "labelLong" = c("time","status","trt", "age","sex","ascites","hepato", "spiders", "edema", "bili", "chol", "albumin", "copper", "alk.phos", "ast", "trig", "platelet", "protime", "stage"),
    "section" = rep("tester", times=19),
    "subject" = rep("tester",times = 19),
    "variableType" = c("cont", "cat", "cat", "cont","cat", "cat", "cat","cat", "cat", rep("cont", times = 9), "cat"),
    "databaseStart" = rep("tester1, tester2", times = 19),
    "units" = rep("NA", times = 19),
    "variableStart" = c("[time]","[status]", "[trt]", "[age]", "[sex]", "[ascites]","[hepato]","[spiders]","[edema]", "[bili]", "[chol]", "[albumin]", "[copper]", "[alk.phos]", "[ast]", "[trig]", "[platelet]", "[protime]","[stage]")
  )
library(survival)
tester1 <- survival::pbc[1:209,]
tester2 <- survival::pbc[210:418,]
db_name1 <- "tester1"
db_name2 <- "tester2"

rec_sample1 <- rec_with_table(data = tester1,
variables = var_sheet,
variable_details = var_details,
database_name = db_name1)
#> NOTE for ascites: This is sample survival pbc data
#> NOTE for ascites: This is sample survival pbc data
#> NOTE for edema: This is sample survival pbc data
#> NOTE for edema: This is sample survival pbc data
#> NOTE for edema: This is sample survival pbc data
#> NOTE for hepato: This is sample survival pbc data
#> NOTE for hepato: This is sample survival pbc data
#> NOTE for sex: This is sample survival pbc data
#> NOTE for sex: This is sample survival pbc data
#> NOTE for spiders: This is sample survival pbc data
#> NOTE for spiders: This is sample survival pbc data
#> NOTE for stage: This is sample survival pbc data
#> NOTE for stage: This is sample survival pbc data
#> NOTE for stage: This is sample survival pbc data
#> NOTE for stage: This is sample survival pbc data
#> NOTE for status: This is sample survival pbc data
#> NOTE for status: This is sample survival pbc data
#> NOTE for status: This is sample survival pbc data
#> NOTE for trt: This is sample survival pbc data
#> NOTE for trt: This is sample survival pbc data

rec_sample2 <- rec_with_table(data = tester2,
variables = var_sheet,
variable_details = var_details,
database_name = db_name2)
#> NOTE for ascites: This is sample survival pbc data
#> NOTE for ascites: This is sample survival pbc data
#> NOTE for edema: This is sample survival pbc data
#> NOTE for edema: This is sample survival pbc data
#> NOTE for edema: This is sample survival pbc data
#> NOTE for hepato: This is sample survival pbc data
#> NOTE for hepato: This is sample survival pbc data
#> NOTE for sex: This is sample survival pbc data
#> NOTE for sex: This is sample survival pbc data
#> NOTE for spiders: This is sample survival pbc data
#> NOTE for spiders: This is sample survival pbc data
#> NOTE for stage: This is sample survival pbc data
#> NOTE for stage: This is sample survival pbc data
#> NOTE for stage: This is sample survival pbc data
#> NOTE for stage: This is sample survival pbc data
#> NOTE for status: This is sample survival pbc data
#> NOTE for status: This is sample survival pbc data
#> NOTE for status: This is sample survival pbc data
#> NOTE for trt: This is sample survival pbc data
#> NOTE for trt: This is sample survival pbc data