Introduction
Overview
The Model Parameters Pipeline is a Python package for applying transformations to data according to the Model Parameters specification developed by Big Life Lab. This package implements a pipeline for sequential data transformations that are commonly used in predictive health models.
This guide will walk you through:
Understanding the Model Parameters specification
Setting up your model configuration files
Running the transformation pipeline
Working with the results
What is the Model Parameters Specification?
The Model Parameters specification is a standardized way to define and apply data transformations used in predictive algorithms. It was developed by Big Life Lab for their predictive health models such as:
HTNPoRT: Hypertension Population Risk Tool
DemPoRT: Dementia Population Risk Tool
CVDPoRT: Cardiovascular Disease Population Risk Tool
MPoRT: Mortality Population Risk Tool
The specification uses CSV files to define transformations, making algorithms:
Transparent: All parameters and transformations are documented in human-readable files
Portable: The same model can be deployed across different platforms and programming languages
Reproducible: Transformations are applied consistently
Supported Transformations
The pipeline supports five types of transformations:
Center: Subtracts a constant value from variables (e.g., age - 50)
Dummy: Creates binary indicator variables for categorical values
Interaction: Multiplies variables together to create interaction terms
RCS: Applies restricted cubic spline transformations for non-linear relationships
Logistic Regression: Applies logistic regression to generate predictions
Installation
Prerequisites:
Python >= 3.10
pandas >= 2.0
numpy >= 1.24
# Install from GitHub (if published)
pip install git+https://github.com/Big-Life-Lab/model-parameters-pipeline-py.git
# Or install from local source
pip install /path/to/model-parameters-pipeline-py
Install in development mode:
pip install -e ".[dev]"
Basic Usage
The package uses a class-based workflow with three steps:
ModelPipeline(...)– Load and validate model configuration files.run(...)– Apply transformations to data.get_output(...)– Extract results
from model_parameters_pipeline import ModelPipeline
# Step 1: Create the pipeline (loads and validates configuration)
pipeline = ModelPipeline("path/to/model-export.csv")
# Step 2: Run the pipeline on your data
pipeline.run(dat="path/to/input-data.csv")
# Step 3: Extract the output as a DataFrame
result = pipeline.get_output()
# View the first few rows
print(result.head())
Using DataFrames for Input Data
You can pass a pandas DataFrame instead of a file path for the input data:
import pandas as pd
# Create the pipeline
pipeline = ModelPipeline("path/to/model-export.csv")
# Load and preprocess your data
input_data = pd.read_csv("path/to/input-data.csv")
# Run pipeline with DataFrame
pipeline.run(dat=input_data)
# Extract the output
result = pipeline.get_output()
This is useful when your data is already loaded or needs preprocessing.
Processing Multiple Datasets
If you need to apply the same model to multiple datasets (e.g., processing batches), reuse the pipeline object for better performance:
# Create the pipeline once -- configuration files are loaded and cached
pipeline = ModelPipeline("path/to/model-export.csv")
# Run on multiple datasets using method chaining
result1 = pipeline.run(dat="batch1_data.csv").get_output()
result2 = pipeline.run(dat="batch2_data.csv").get_output()
result3 = pipeline.run(dat="batch3_data.csv").get_output()
This avoids re-reading and parsing the configuration files for each batch.
Method Chaining
run() returns self, so you can chain calls:
pipeline = ModelPipeline("path/to/model-export.csv")
result = pipeline.run(dat="path/to/input-data.csv").get_output()
Restricting File Access with sandbox_path
When running on a server or any public-facing system, the model configuration
files can reference arbitrary paths on the filesystem. Use the sandbox_path
parameter to restrict which files the pipeline is allowed to read.
pipeline = ModelPipeline(
"path/to/model-files/model-export.csv",
sandbox_path="path/to/model-files/",
)
When sandbox_path is set, every file referenced inside the model
configuration (the model export, variables file, model-steps file, and any
step parameter files) must be located within that directory. If any path
resolves outside of it, the constructor raises a ValueError.
This restriction applies only to the model configuration files – it does
not affect data files passed to run(). It does, however, affect the model
export file passed to the constructor.
When to use it:
You expose the pipeline as a web service or API and the model export path (or paths inside it) could be influenced by user input.
You want to enforce that a model package stays self-contained within a specific directory and never reads files from elsewhere on the filesystem.
When you can omit it:
You are running the pipeline locally in a trusted environment where all model files are under your control and path traversal is not a concern. The default (
sandbox_path=None) imposes no restriction.
Working with Results
run() applies transformations in place and returns self for method
chaining. Use get_output() to extract a DataFrame. The mode argument
(default "output") controls what columns are returned:
"output": only the columns produced by the final transformation step"full": all columns – original predictors plus every intermediate and output column
pipeline.run(dat="path/to/input-data.csv")
# Default mode: only the final step's output columns
output = pipeline.get_output()
# Full mode: all columns including intermediate transformation variables
output_full = pipeline.get_output(mode="full")
# If the model includes a logistic-regression step, extract predictions
# Logistic predictions are stored in columns starting with "logistic_"
predictions = output_full.filter(regex=r"^logistic_")
# View column names to see what transformations were created
print(output_full.columns.tolist())
Real-World Example: HTNPoRT Model
The HTNPoRT (Hypertension Population Risk Tool) is a validated predictive model for hypertension risk. Here’s how to use this package with HTNPoRT:
# Clone the HTNPoRT repository to get model parameters and validation data
# In your terminal:
# git clone https://github.com/Big-Life-Lab/htnport.git
import pandas as pd
from pathlib import Path
from model_parameters_pipeline import ModelPipeline
# Set path to cloned HTNPoRT repository
htnport_dir = Path("/path/to/htnport")
# Load validation data
data_file = (
htnport_dir
/ "output/validation-data/HTNPoRT-female-validation-data.csv"
)
data = pd.read_csv(data_file)
# View the input data structure
print(data.head())
# Path to model export file
model_export_file = (
htnport_dir
/ "output/logistic-model-export/female/HTNPoRT-female-model-export.csv"
)
# Create and run the pipeline, extract full output
pipeline = ModelPipeline(model_export_file)
pipeline.run(dat=data)
result_full = pipeline.get_output(mode="full")
# View the transformed data with all intermediate steps
print(result_full.head())
# Extract logistic predictions (hypertension risk probabilities)
predictions = result_full.filter(regex=r"^logistic_")
print(predictions.head())
# Summary statistics of predictions
print(predictions.describe())
Next Steps
For detailed information about the Model Parameters specification, see the Model Parameters Reference Documentation
To add new transformation steps, see the Adding a New Transformation Step guide
To report issues or request features, visit the issue tracker