Model Parameter files
Model parameter files are created to identify what needs to be done to the data, how the algorithm works, and where to find the files. This document is the single source of truth for all model parameter CSV files. It will go over the different features currently supported by the files, how to represent each feature in the files and a summary of all the columns in the files.
Model Exports File
One of the four mandatory files that the user has to provide. This file contains a list and location of all the files required by the algorithm (e.g., files to transform data) and after the algorithm (e.g., validation, categorization of outcomes (if applicable)). The metadata for this file is given below:
An example is provided below
Going into more detail,
- The first row gives the location of the variables file. It’s located in the same folder as the model export file and is called
variables.csv - The second row gives the location of the variable details file. It’s located two folders above the model export file and is called
variable-details.csv - The third row gives the location of the model steps file. It’s located in the same folder as the model export file and is called
model-steps.csv
Model Steps File
One of four mandatory files. This file specifies the steps for calculating the outcome of the model as well as their order. It does not specify the data for each step, this will be defined in other files. The metadata for this file is:
An example is given below
The above example model steps file has two steps, dummy and center, defined in the order in which they should be done.
The file which describes the dummy step is called dummy-data.csv and is in the same folder as the model steps file.
The file which describes the center step is called center-data.csv and is in the folder above the model steps file.
Notice how the fileType columns for both these steps is N/A or Not Applicable. Since the dummy and center steps require only one file to describe them, we do not need to provide a value for the fileType columns for those rows. The fine and gray model step described in this document required multiple files to describe it, hence requiring the use of the fileType column.
Variable Details File
One of the mandatory files. Its structure is defined in the recodeflow library.
Variables File
One of the mandatory files that defines the starting variables in the model. Its columns have been defined in the recodeflow library.
Tables File
This file lists the all the tables referenced within the variables and variable details sheet. Each row in this file gives the name of the table and the path of the table relative to this file. For example,
The above example tables file has entries for two tables named table-one and table-two. table-one is located in the same folder as the tables file and is called table-one.csv while table-two is located one folder above the tables file and is called table-two.csv.
Dummy Step File
Dummying is a transformation where a categorical variable with greater than categories is converted to a set of dicotomous variables. Information about how this transformation is done can be seen here.
The metadata for this file is provided below:
An example is given below
The example above describes the data for dummying two variables:
- smoker_type
- drinker_type
The smoker_type variable has three dummy variables:
non_smoker: Represents category 0 in thesmoker_typevariablecurrent_smoker: Represents category 1 insmoker_typevariableformer_smoker: Represents category 2 in thesmoker_typevariable
The drinker_type variable has four dummy variables:
non_drinker: Represents category 0 in thedrinker_typevariablecurrent_light_drinker: Represents category 1 in thedrinker_typevariablecurrent_heavy_drinker: Represents category 2 in thedrinker_typevariableformer_drinker: Represents category 3 in thedrinker_typevariable
Centering Step File
Centering is a transformation where the original variable is subtracted by a value, usually it’s mean or median, to redifine the 0 point for the predictor to be whatever value you subtracted. The metadata for this file is given below:
An example is given below,
The example above describes the data for centering two variables, age and diet_score
The values used to center the two variables are 40 and 4 respectively
The names of the new centered variables are age_C and diet_score_C and they are both continuous
Restricted Cubic Spline (RCS) Step File
Rubric cubic splines are often used to interpolate a set of data points and guarantee smoothness at the data points. The metadata for this file is given below:
An example is given below,
The example describes the data for creating new spline variables for two variables, Age_c and PackYears_c
For Age_c, a 5 knot spline is used and 4 new spline variables are created. The knots given are,
- -11.5
- -6.5
- -1.5
- 5.5
- 16.5
The new variables created are,
- AgeC_rcs1
- AgeC_rcs2
- AgeC_rcs3
- AgeC_rcs4
Notice how in the CSV, each knot is seperated by a “;” and each new variable name is also seperated by a “;”
Similarly, for PackYears_c, we define a 3 knot spline where the knots are -21.2, -10.35 and 34.8. The new spline variables are PackYearsC_rcs1 and PackYearsC_rcs2.
Interaction Step File
Theoretically, an interaction variable is created to model effect modification. Effect modification occurs when the magnitude of the effect of a risk factor on a outcome differs depending on another risk factor. An example is age, cancer and their effect on mortality, where as you get older the effect that cancer has on your risk of mortality increases. Here, age is said to modify the effect that cancer has on mortality. Mathematically, an effect modification is modeled by creating a new variable whose value is the product of the two “interacting” variables, e.g., age and cancer.
The metadata for this file is given below:
An example is given below,
The example above creates two new interaction variables, AgeXCancer and AgeXHypertension. AgeXCancer is created using the Age and Cancer variables while AgeXHypertension is created using the Age and Hypertension variables. Both new interaction variables are continuous. Notice how the interacting variables are seperated by a semi-colon in the example.
Simple Model
A simple is model is one whose outcome is calculated using a series of transformations defined in the variables and variable details file. Statistically speaking, a simple model is derived using a fitting process for example using the least squares method. For example, a model to calculate the daily sodium consumption for an individual would be calculated by summing the average sodium values for each item they eat in a day.
The export file for a simple model only needs to specify the name of the outcome variable, which should correspond to a variable in the variables.csv sheet. An example is shown below,
eg_simple_model <- data.frame(
"name" = c("outputVariableName"),
"value" = c("sodium_per_day")
)The above file defines a simple model to calculate the average daily sodium consumption for an individual. The model details i.e. how to calculate the sodium_per_day variable should be defined in the variables.csv and variable-details.csv files.
Fine and Gray Model Step Files
Similar to a Cox regression model, a Fine and Gray model estimates the risk of an event occuring at some time in the future, the difference being that it takes competing risks into account. Specifying this step requires two files:
- A beta coefficients file that specifies the names of the covariates in the model and their beta coefficient
- A baseline hazards file that specifies a time value and the baseline hazard value associated with it
The metadata for the beta coefficients file is given below:
The metadata for the baseline hazards file is given below:
The baseline hazards file has the following columns:
time: The time up to which this baseline hazard should be used
baselineHazard: The baseline hazard value for this time period
When adding a Fine and Gray model step, make sure to add a variable in the variable details sheet called “time”. This variable represents the time interval when the model is valid. It should be a continuous variable whose recTo values represents the time interval. It’s units column should specify the time metric being used for example years. For example, a model that can predict risk at any point in time between 1 year and 5 years from today would have the following entry in the variable details sheet,
The values of the other columns in the variable details sheet for this variable should be N/A.
An example model steps file and the referenced fine and gray model files are shown below.
The model steps file has 2 rows, both of which are a Fine and Gray step. The type of file described in each row is is given by fileType column. The first row references the a beta coefficients file and the second row references a baseline hazards file. The path to each file is given in the filePath column and once again should be relative to the model steps file that described them.
The beta coefficients file says that the model has two covariates, Age and Sex and specifies the beta coefficients for each one which are 0.01 and 2 respectively.
The baseline hazards has 5 rows and describes the baseline hazards for each year in the model. Notice how the lowest time value is 1 and the highest value is 5. This is because the time variable we specified earlier goes from 1 to 5.