Introduction

Overview

The Model Parameters Pipeline is a Python package for applying transformations to data according to the Model Parameters specification developed by Big Life Lab. This package implements a pipeline for sequential data transformations that are commonly used in predictive health models.

This guide will walk you through:

  1. Understanding the Model Parameters specification

  2. Setting up your model configuration files

  3. Running the transformation pipeline

  4. Working with the results

What is the Model Parameters Specification?

The Model Parameters specification is a standardized way to define and apply data transformations used in predictive algorithms. It was developed by Big Life Lab for their predictive health models such as:

  • HTNPoRT: Hypertension Population Risk Tool

  • DemPoRT: Dementia Population Risk Tool

  • CVDPoRT: Cardiovascular Disease Population Risk Tool

  • MPoRT: Mortality Population Risk Tool

The specification uses CSV files to define transformations, making algorithms:

  • Transparent: All parameters and transformations are documented in human-readable files

  • Portable: The same model can be deployed across different platforms and programming languages

  • Reproducible: Transformations are applied consistently

Supported Transformations

The pipeline supports five types of transformations:

  1. Center: Subtracts a constant value from variables (e.g., age - 50)

  2. Dummy: Creates binary indicator variables for categorical values

  3. Interaction: Multiplies variables together to create interaction terms

  4. RCS: Applies restricted cubic spline transformations for non-linear relationships

  5. Logistic Regression: Applies logistic regression to generate predictions

Installation

Prerequisites:

  • Python >= 3.10

  • pandas >= 2.0

  • numpy >= 1.24

# Install from GitHub (if published)
pip install git+https://github.com/Big-Life-Lab/model-parameters-pipeline-py.git

# Or install from local source
pip install /path/to/model-parameters-pipeline-py

Install in development mode:

pip install -e ".[dev]"

Basic Usage

The package uses a class-based workflow with three steps:

  1. ModelPipeline(...) – Load and validate model configuration files

  2. .run(...) – Apply transformations to data

  3. .get_output(...) – Extract results

from model_parameters_pipeline import ModelPipeline

# Step 1: Create the pipeline (loads and validates configuration)
pipeline = ModelPipeline("path/to/model-export.csv")

# Step 2: Run the pipeline on your data
pipeline.run(dat="path/to/input-data.csv")

# Step 3: Extract the output as a DataFrame
result = pipeline.get_output()

# View the first few rows
print(result.head())

Using DataFrames for Input Data

You can pass a pandas DataFrame instead of a file path for the input data:

import pandas as pd

# Create the pipeline
pipeline = ModelPipeline("path/to/model-export.csv")

# Load and preprocess your data
input_data = pd.read_csv("path/to/input-data.csv")

# Run pipeline with DataFrame
pipeline.run(dat=input_data)

# Extract the output
result = pipeline.get_output()

This is useful when your data is already loaded or needs preprocessing.

Processing Multiple Datasets

If you need to apply the same model to multiple datasets (e.g., processing batches), reuse the pipeline object for better performance:

# Create the pipeline once -- configuration files are loaded and cached
pipeline = ModelPipeline("path/to/model-export.csv")

# Run on multiple datasets using method chaining
result1 = pipeline.run(dat="batch1_data.csv").get_output()
result2 = pipeline.run(dat="batch2_data.csv").get_output()
result3 = pipeline.run(dat="batch3_data.csv").get_output()

This avoids re-reading and parsing the configuration files for each batch.

Method Chaining

run() returns self, so you can chain calls:

pipeline = ModelPipeline("path/to/model-export.csv")
result = pipeline.run(dat="path/to/input-data.csv").get_output()

Restricting File Access with sandbox_path

When running on a server or any public-facing system, the model configuration files can reference arbitrary paths on the filesystem. Use the sandbox_path parameter to restrict which files the pipeline is allowed to read.

pipeline = ModelPipeline(
    "path/to/model-files/model-export.csv",
    sandbox_path="path/to/model-files/",
)

When sandbox_path is set, every file referenced inside the model configuration (the model export, variables file, model-steps file, and any step parameter files) must be located within that directory. If any path resolves outside of it, the constructor raises a ValueError.

This restriction applies only to the model configuration files – it does not affect data files passed to run(). It does, however, affect the model export file passed to the constructor.

When to use it:

  • You expose the pipeline as a web service or API and the model export path (or paths inside it) could be influenced by user input.

  • You want to enforce that a model package stays self-contained within a specific directory and never reads files from elsewhere on the filesystem.

When you can omit it:

  • You are running the pipeline locally in a trusted environment where all model files are under your control and path traversal is not a concern. The default (sandbox_path=None) imposes no restriction.

Working with Results

run() applies transformations in place and returns self for method chaining. Use get_output() to extract a DataFrame. The mode argument (default "output") controls what columns are returned:

  • "output": only the columns produced by the final transformation step

  • "full": all columns – original predictors plus every intermediate and output column

pipeline.run(dat="path/to/input-data.csv")

# Default mode: only the final step's output columns
output = pipeline.get_output()

# Full mode: all columns including intermediate transformation variables
output_full = pipeline.get_output(mode="full")

# If the model includes a logistic-regression step, extract predictions
# Logistic predictions are stored in columns starting with "logistic_"
predictions = output_full.filter(regex=r"^logistic_")

# View column names to see what transformations were created
print(output_full.columns.tolist())

Real-World Example: HTNPoRT Model

The HTNPoRT (Hypertension Population Risk Tool) is a validated predictive model for hypertension risk. Here’s how to use this package with HTNPoRT:

# Clone the HTNPoRT repository to get model parameters and validation data
# In your terminal:
# git clone https://github.com/Big-Life-Lab/htnport.git

import pandas as pd
from pathlib import Path
from model_parameters_pipeline import ModelPipeline

# Set path to cloned HTNPoRT repository
htnport_dir = Path("/path/to/htnport")

# Load validation data
data_file = (
    htnport_dir
    / "output/validation-data/HTNPoRT-female-validation-data.csv"
)
data = pd.read_csv(data_file)

# View the input data structure
print(data.head())

# Path to model export file
model_export_file = (
    htnport_dir
    / "output/logistic-model-export/female/HTNPoRT-female-model-export.csv"
)

# Create and run the pipeline, extract full output
pipeline = ModelPipeline(model_export_file)
pipeline.run(dat=data)
result_full = pipeline.get_output(mode="full")

# View the transformed data with all intermediate steps
print(result_full.head())

# Extract logistic predictions (hypertension risk probabilities)
predictions = result_full.filter(regex=r"^logistic_")
print(predictions.head())

# Summary statistics of predictions
print(predictions.describe())

Next Steps

Additional Resources