Overview
This vignette will walk you through:
- Running the transformation pipeline on your own data
- Running the transformation pipeline on the Hypertension Population Risk Tool (HTNPoRT)
Basic Usage
Simple Example
Let’s walk through a simple example. The package uses a two-step workflow:
-
prepare_model_pipeline()- Load and validate model configuration files -
run_model_pipeline()- Apply transformations to data
# Step 1: Prepare the model pipeline
mod <- prepare_model_pipeline("path/to/model-export.csv")
# Step 2: Run the pipeline on your data (returns the output)
result <- run_model_pipeline(mod, x = "path/to/input-data.csv")
# View the first few rows
head(result)Using Data Frames for Input Data
You can pass a data frame instead of a file path for the input data:
# Prepare the model
mod <- prepare_model_pipeline("path/to/model-export.csv")
# Load and preprocess your data
input_data <- read.csv("path/to/input-data.csv")
# Run pipeline with data frame (returns the output)
result <- run_model_pipeline(mod, x = input_data)This is useful when your data is already loaded or needs preprocessing.
Processing Multiple Datasets
If you need to apply the same model to multiple datasets (e.g., processing batches), reuse the prepared model object for better performance:
# Prepare the model once - configuration files are loaded and cached
mod <- prepare_model_pipeline("path/to/model-export.csv")
# Run on multiple datasets and extract output from each
result1 <- run_model_pipeline(mod, x = "batch1_data.csv")
result2 <- run_model_pipeline(mod, x = "batch2_data.csv")
result3 <- run_model_pipeline(mod, x = "batch3_data.csv")This avoids re-reading and parsing the configuration files for each batch.
Restricting File Access with sandbox_path
When running on a server or any public-facing system, the model
configuration files can reference arbitrary paths on the filesystem. Use
the sandbox_path parameter to restrict which files the
pipeline is allowed to read.
mod <- prepare_model_pipeline(
"path/to/model-files/model-export.csv",
sandbox_path = "path/to/model-files/"
)When sandbox_path is set, every file referenced inside
the model configuration (the model export, variables file, model-steps
file, and any step parameter files) must be located within that
directory. If any path resolves outside of it,
prepare_model_pipeline() stops with an error.
This restriction applies only to the model configuration
files — it does not affect data files passed to
run_model_pipeline(). It does, however, affect the model
export file passed to prepare_model_pipeline().
When to use it:
- You expose the pipeline as a web service or API and the model export path (or paths inside it) could be influenced by user input.
- You want to enforce that a model package stays self-contained within a specific directory and never reads files from elsewhere on the filesystem.
When you can omit it:
- You are running the pipeline locally in a trusted environment where
all model files are under your control and path traversal is not a
concern. The default (
sandbox_path = NULL) imposes no restriction.
Working with Results
run_model_pipeline() applies the pipeline’s
transformations and returns the results. The mode argument
of run_model_pipeline() (default "output")
controls what columns are returned:
-
"output": only the columns produced by the final transformation step -
"full": all columns — original predictors plus every intermediate and output column
# Default mode: only the final step's output columns
output <- run_model_pipeline(mod, x = "path/to/input-data.csv")
# Full mode: all columns including intermediate transformation variables
output_full <- run_model_pipeline(
mod,
x = "path/to/input-data.csv",
mode = "full"
)
# View column names to see what transformations were created
colnames(output_full)Real-World Example: HTNPoRT Model
The HTNPoRT (Hypertension Population Risk Tool) is a validated predictive model for hypertension risk. Here’s how to use this package with HTNPoRT:
# Clone the HTNPoRT repository to get model parameters and validation data
# In your terminal:
# git clone https://github.com/Big-Life-Lab/htnport.git
library(model.parameters.pipeline)
# Set path to cloned HTNPoRT repository
htnport_dir <- "/path/to/htnport"
# Load validation data
data_file <- file.path(
htnport_dir,
"output/validation-data/HTNPoRT-female-validation-data.csv"
)
data <- read.csv(data_file)
# View the input data structure
head(data)
# Path to model export file
model_export_file <- file.path(
htnport_dir,
"output/logistic-model-export/female/HTNPoRT-female-model-export.csv"
)
# Prepare the model pipeline
mod <- prepare_model_pipeline(model_export_file)
# Run the pipeline
predictions <- run_model_pipeline(mod, x = data)
# View the logistic predictions (hypertension risk probabilities)
head(predictions)
# Summary statistics of predictions
summary(predictions)Next Steps
- For detailed information about the Model Parameters specification, see the Model Parameters Reference Documentation
- To add new transformation steps, see the Adding a New Transformation Step guide
- To report issues or request features, visit the issue tracker