Introduction ============ Overview -------- The Model Parameters Pipeline is a Python package for applying transformations to data according to the `Model Parameters `_ specification developed by Big Life Lab. This package implements a pipeline for sequential data transformations that are commonly used in predictive health models. This guide will walk you through: 1. Understanding the Model Parameters specification 2. Setting up your model configuration files 3. Running the transformation pipeline 4. Working with the results What is the Model Parameters Specification? -------------------------------------------- The Model Parameters specification is a standardized way to define and apply data transformations used in predictive algorithms. It was developed by Big Life Lab for their predictive health models such as: - **HTNPoRT**: Hypertension Population Risk Tool - **DemPoRT**: Dementia Population Risk Tool - **CVDPoRT**: Cardiovascular Disease Population Risk Tool - **MPoRT**: Mortality Population Risk Tool The specification uses CSV files to define transformations, making algorithms: - **Transparent**: All parameters and transformations are documented in human-readable files - **Portable**: The same model can be deployed across different platforms and programming languages - **Reproducible**: Transformations are applied consistently Supported Transformations ------------------------- The pipeline supports five types of transformations: 1. **Center**: Subtracts a constant value from variables (e.g., age - 50) 2. **Dummy**: Creates binary indicator variables for categorical values 3. **Interaction**: Multiplies variables together to create interaction terms 4. **RCS**: Applies restricted cubic spline transformations for non-linear relationships 5. **Logistic Regression**: Applies logistic regression to generate predictions Installation ------------ Prerequisites: - Python >= 3.10 - pandas >= 2.0 - numpy >= 1.24 .. code-block:: bash # Install from GitHub (if published) pip install git+https://github.com/Big-Life-Lab/model-parameters-pipeline-py.git # Or install from local source pip install /path/to/model-parameters-pipeline-py Install in development mode:: pip install -e ".[dev]" Basic Usage ----------- The package uses a class-based workflow with three steps: 1. ``ModelPipeline(...)`` -- Load and validate model configuration files 2. ``.run(...)`` -- Apply transformations to data 3. ``.get_output(...)`` -- Extract results .. code-block:: python from model_parameters_pipeline import ModelPipeline # Step 1: Create the pipeline (loads and validates configuration) pipeline = ModelPipeline("path/to/model-export.csv") # Step 2: Run the pipeline on your data pipeline.run(dat="path/to/input-data.csv") # Step 3: Extract the output as a DataFrame result = pipeline.get_output() # View the first few rows print(result.head()) Using DataFrames for Input Data ------------------------------- You can pass a pandas DataFrame instead of a file path for the input data: .. code-block:: python import pandas as pd # Create the pipeline pipeline = ModelPipeline("path/to/model-export.csv") # Load and preprocess your data input_data = pd.read_csv("path/to/input-data.csv") # Run pipeline with DataFrame pipeline.run(dat=input_data) # Extract the output result = pipeline.get_output() This is useful when your data is already loaded or needs preprocessing. Processing Multiple Datasets ----------------------------- If you need to apply the same model to multiple datasets (e.g., processing batches), reuse the pipeline object for better performance: .. code-block:: python # Create the pipeline once -- configuration files are loaded and cached pipeline = ModelPipeline("path/to/model-export.csv") # Run on multiple datasets using method chaining result1 = pipeline.run(dat="batch1_data.csv").get_output() result2 = pipeline.run(dat="batch2_data.csv").get_output() result3 = pipeline.run(dat="batch3_data.csv").get_output() This avoids re-reading and parsing the configuration files for each batch. Method Chaining --------------- ``run()`` returns ``self``, so you can chain calls: .. code-block:: python pipeline = ModelPipeline("path/to/model-export.csv") result = pipeline.run(dat="path/to/input-data.csv").get_output() Restricting File Access with ``sandbox_path`` --------------------------------------------- When running on a server or any public-facing system, the model configuration files can reference arbitrary paths on the filesystem. Use the ``sandbox_path`` parameter to restrict which files the pipeline is allowed to read. .. code-block:: python pipeline = ModelPipeline( "path/to/model-files/model-export.csv", sandbox_path="path/to/model-files/", ) When ``sandbox_path`` is set, every file referenced inside the model configuration (the model export, variables file, model-steps file, and any step parameter files) must be located within that directory. If any path resolves outside of it, the constructor raises a ``ValueError``. This restriction applies only to the **model configuration files** -- it does not affect data files passed to ``run()``. It does, however, affect the model export file passed to the constructor. **When to use it:** - You expose the pipeline as a web service or API and the model export path (or paths inside it) could be influenced by user input. - You want to enforce that a model package stays self-contained within a specific directory and never reads files from elsewhere on the filesystem. **When you can omit it:** - You are running the pipeline locally in a trusted environment where all model files are under your control and path traversal is not a concern. The default (``sandbox_path=None``) imposes no restriction. Working with Results -------------------- ``run()`` applies transformations in place and returns ``self`` for method chaining. Use ``get_output()`` to extract a DataFrame. The ``mode`` argument (default ``"output"``) controls what columns are returned: - ``"output"``: only the columns produced by the final transformation step - ``"full"``: all columns -- original predictors plus every intermediate and output column .. code-block:: python pipeline.run(dat="path/to/input-data.csv") # Default mode: only the final step's output columns output = pipeline.get_output() # Full mode: all columns including intermediate transformation variables output_full = pipeline.get_output(mode="full") # If the model includes a logistic-regression step, extract predictions # Logistic predictions are stored in columns starting with "logistic_" predictions = output_full.filter(regex=r"^logistic_") # View column names to see what transformations were created print(output_full.columns.tolist()) Real-World Example: HTNPoRT Model --------------------------------- The HTNPoRT (Hypertension Population Risk Tool) is a validated predictive model for hypertension risk. Here's how to use this package with HTNPoRT: .. code-block:: python # Clone the HTNPoRT repository to get model parameters and validation data # In your terminal: # git clone https://github.com/Big-Life-Lab/htnport.git import pandas as pd from pathlib import Path from model_parameters_pipeline import ModelPipeline # Set path to cloned HTNPoRT repository htnport_dir = Path("/path/to/htnport") # Load validation data data_file = ( htnport_dir / "output/validation-data/HTNPoRT-female-validation-data.csv" ) data = pd.read_csv(data_file) # View the input data structure print(data.head()) # Path to model export file model_export_file = ( htnport_dir / "output/logistic-model-export/female/HTNPoRT-female-model-export.csv" ) # Create and run the pipeline, extract full output pipeline = ModelPipeline(model_export_file) pipeline.run(dat=data) result_full = pipeline.get_output(mode="full") # View the transformed data with all intermediate steps print(result_full.head()) # Extract logistic predictions (hypertension risk probabilities) predictions = result_full.filter(regex=r"^logistic_") print(predictions.head()) # Summary statistics of predictions print(predictions.describe()) Next Steps ---------- - For detailed information about the Model Parameters specification, see the `Model Parameters Reference Documentation `_ - To add new transformation steps, see the `Adding a New Transformation Step `_ guide - To report issues or request features, visit the `issue tracker `_ Additional Resources -------------------- - `Big Life Lab GitHub `_ - `Model Parameters Specification `_ - `HTNPoRT Model `_ - `Predictive Algorithms Repository `_