Introduction

Overview

The Model Parameters Pipeline is a Python package for applying transformations to data according to the Model Parameters specification developed by Big Life Lab. This package implements a pipeline for sequential data transformations that are commonly used in predictive health models.

This guide will walk you through:

Understanding the Model Parameters specification
Setting up your model configuration files
Running the transformation pipeline
Working with the results

What is the Model Parameters Specification?

The Model Parameters specification is a standardized way to define and apply data transformations used in predictive algorithms. It was developed by Big Life Lab for their predictive health models such as:

HTNPoRT: Hypertension Population Risk Tool
DemPoRT: Dementia Population Risk Tool
CVDPoRT: Cardiovascular Disease Population Risk Tool
MPoRT: Mortality Population Risk Tool

The specification uses CSV files to define transformations, making algorithms:

Transparent: All parameters and transformations are documented in human-readable files
Portable: The same model can be deployed across different platforms and programming languages
Reproducible: Transformations are applied consistently

Supported Transformations

The pipeline supports five types of transformations:

Center: Subtracts a constant value from variables (e.g., age - 50)
Dummy: Creates binary indicator variables for categorical values
Interaction: Multiplies variables together to create interaction terms
RCS: Applies restricted cubic spline transformations for non-linear relationships
Logistic Regression: Applies logistic regression to generate predictions

Installation

Prerequisites:

Python >= 3.10
pandas >= 2.0
numpy >= 1.24

# Install from GitHub (if published)
pip install git+https://github.com/Big-Life-Lab/model-parameters-pipeline-py.git

# Or install from local source
pip install /path/to/model-parameters-pipeline-py

Install in development mode:

pip install -e ".[dev]"

Basic Usage

The package uses a class-based workflow with three steps:

ModelPipeline(...) – Load and validate model configuration files
.run(...) – Apply transformations to data
.get_output(...) – Extract results

from model_parameters_pipeline import ModelPipeline

# Step 1: Create the pipeline (loads and validates configuration)
pipeline = ModelPipeline("path/to/model-export.csv")

# Step 2: Run the pipeline on your data
pipeline.run(dat="path/to/input-data.csv")

# Step 3: Extract the output as a DataFrame
result = pipeline.get_output()

# View the first few rows
print(result.head())

Using DataFrames for Input Data

You can pass a pandas DataFrame instead of a file path for the input data:

import pandas as pd

# Create the pipeline
pipeline = ModelPipeline("path/to/model-export.csv")

# Load and preprocess your data
input_data = pd.read_csv("path/to/input-data.csv")

# Run pipeline with DataFrame
pipeline.run(dat=input_data)

# Extract the output
result = pipeline.get_output()

This is useful when your data is already loaded or needs preprocessing.

Processing Multiple Datasets

If you need to apply the same model to multiple datasets (e.g., processing batches), reuse the pipeline object for better performance:

# Create the pipeline once -- configuration files are loaded and cached
pipeline = ModelPipeline("path/to/model-export.csv")

# Run on multiple datasets using method chaining
result1 = pipeline.run(dat="batch1_data.csv").get_output()
result2 = pipeline.run(dat="batch2_data.csv").get_output()
result3 = pipeline.run(dat="batch3_data.csv").get_output()

This avoids re-reading and parsing the configuration files for each batch.

Method Chaining

run() returns self, so you can chain calls:

pipeline = ModelPipeline("path/to/model-export.csv")
result = pipeline.run(dat="path/to/input-data.csv").get_output()

Restricting File Access with `sandbox_path`

When running on a server or any public-facing system, the model configuration files can reference arbitrary paths on the filesystem. Use the sandbox_path parameter to restrict which files the pipeline is allowed to read.

pipeline = ModelPipeline(
    "path/to/model-files/model-export.csv",
    sandbox_path="path/to/model-files/",
)

When sandbox_path is set, every file referenced inside the model configuration (the model export, variables file, model-steps file, and any step parameter files) must be located within that directory. If any path resolves outside of it, the constructor raises a ValueError.

This restriction applies only to the model configuration files – it does not affect data files passed to run(). It does, however, affect the model export file passed to the constructor.

When to use it:

You expose the pipeline as a web service or API and the model export path (or paths inside it) could be influenced by user input.
You want to enforce that a model package stays self-contained within a specific directory and never reads files from elsewhere on the filesystem.

When you can omit it:

You are running the pipeline locally in a trusted environment where all model files are under your control and path traversal is not a concern. The default (sandbox_path=None) imposes no restriction.

Working with Results

run() applies transformations in place and returns self for method chaining. Use get_output() to extract a DataFrame. The mode argument (default "output") controls what columns are returned:

"output": only the columns produced by the final transformation step
"full": all columns – original predictors plus every intermediate and output column

pipeline.run(dat="path/to/input-data.csv")

# Default mode: only the final step's output columns
output = pipeline.get_output()

# Full mode: all columns including intermediate transformation variables
output_full = pipeline.get_output(mode="full")

# If the model includes a logistic-regression step, extract predictions
# Logistic predictions are stored in columns starting with "logistic_"
predictions = output_full.filter(regex=r"^logistic_")

# View column names to see what transformations were created
print(output_full.columns.tolist())

Real-World Example: HTNPoRT Model

The HTNPoRT (Hypertension Population Risk Tool) is a validated predictive model for hypertension risk. Here’s how to use this package with HTNPoRT:

# Clone the HTNPoRT repository to get model parameters and validation data
# In your terminal:
# git clone https://github.com/Big-Life-Lab/htnport.git

import pandas as pd
from pathlib import Path
from model_parameters_pipeline import ModelPipeline

# Set path to cloned HTNPoRT repository
htnport_dir = Path("/path/to/htnport")

# Load validation data
data_file = (
    htnport_dir
    / "output/validation-data/HTNPoRT-female-validation-data.csv"
)
data = pd.read_csv(data_file)

# View the input data structure
print(data.head())

# Path to model export file
model_export_file = (
    htnport_dir
    / "output/logistic-model-export/female/HTNPoRT-female-model-export.csv"
)

# Create and run the pipeline, extract full output
pipeline = ModelPipeline(model_export_file)
pipeline.run(dat=data)
result_full = pipeline.get_output(mode="full")

# View the transformed data with all intermediate steps
print(result_full.head())

# Extract logistic predictions (hypertension risk probabilities)
predictions = result_full.filter(regex=r"^logistic_")
print(predictions.head())

# Summary statistics of predictions
print(predictions.describe())

Next Steps

For detailed information about the Model Parameters specification, see the Model Parameters Reference Documentation
To add new transformation steps, see the Adding a New Transformation Step guide
To report issues or request features, visit the issue tracker