Reference Expressions

When creating a set of model parameter files for an algorithm, you may encounter situations where you’re copying values between files. For example, you may want to use the median value of a variable described in a descriptive statistics file in a centering step file. Rather than copying the median value to the centering step file, we can use a reference expression to ensure that the two values are always synchronized, for example if the descriptive statistics file is updated. By using a reference expression we’re creating a link between the cells of two different model parameter files.

Imagine an algorithm developed to predict the risk of developing diabetes that has three variables, age, sex, and ethnicity. Age is a continuous variable, sex is dichotomous with categories male and female, and ethnicity is dichotomous with categories white and other. The descriptive statistics file named diabetes-des-stats with the mean for each variable and its categories are shown below.

Lets assume this is a cox proportionals hazards model, so the model parameter files should include all the steps to transform the initial set of variables to their final form, before they can be scored to calculate the risk of developing diabetes. This algorithm has only two steps, a step to create the dummy variables and a centering step. The variables are centered on their mean and the centering file can be seen below,

Note that only the male and ethnicity white variables were centered, females and ethnicity “other” are the reference.

Without reference expressions the centering values in the centerValue column need to be copied from the descriptive statistics file. Every time the descriptive statistics file changes, the centering values need to be updated also. This may work for one file or for an algorithm with a few variables but gets tedious for more complicated algorithms with more model parameter files.

Using reference expressions, we can eliminate these issues. The new centering step file is shown below.

The centerValue column no longer has numbers but reference expressions. Each reference expression refers to a particular cell in the descriptive statistics file. If you’re familiar with R code than you’ll notice that its the same code you would use to subset a data frame. If you’re unfamiliar with R, you can use this datacamp tutorial to learn. In R, you would need to use the name of the data frame in your code, similarly here you would use the name of the referenced file instead. For example here the file name is diabetes-des-stats. There’s no need to include the file extension for example .csv when referencing a file.

Lets look at another example, but this time in the validation file for this algorithm.

The validation file has two cells where a reference expression was used:

  1. To specify the values that are allowed for the Sex variable. Here we want to specify all the category values for this variable but rather than adding in Male;Female we can simply reference these values from the catValue column in the descriptive statistics file using a reference expression.
  2. To specify the replacement value if the Age_centered variable is missing. Once again, rather than adding the actual mean value from the descriptive statistics file we can reference it using a reference expression.

Currently, reference expressions can only be used in the following columns:

  • The value column in the validation file for an allowed rule
  • The error_replace column in the validation file
  • The centerValue column in a centering step file

Also note that, referenced files need to be included in the list of files in a model export file.