Adding a New Transformation Step

This guide explains how to add support for a new transformation step to the Model Parameters Pipeline, as defined by the Model Parameters repository.

Overview

Adding a new transformation step requires three main tasks:

Register the step in the _STEP_DISPATCH dictionary in pipeline.py
Create a new step module – Implement run_step_{stepname} in src/model_parameters_pipeline/steps/{stepname}.py
Add unit tests – Create test files to verify correct behavior

Step 1: Register the Step in `_STEP_DISPATCH`

The pipeline.py module contains a _STEP_DISPATCH dictionary that maps step names to their implementation functions. The run method uses this dictionary to dispatch each step.

Location

Find the _STEP_DISPATCH dictionary near the top of pipeline.py:

_STEP_DISPATCH: dict[str, Callable[[ModelPipeline, str | Path], list[str]]] = {
    "center": run_step_center,
    "dummy": run_step_dummy,
    "interaction": run_step_interaction,
    "logistic-regression": run_step_logistic_regression,
    "rcs": run_step_rcs,
}

Add Your Step

Import your new step function at the top of the file:

from model_parameters_pipeline.steps.your_step_name import run_step_your_step_name

Add a new entry to _STEP_DISPATCH:

_STEP_DISPATCH: dict[str, Callable[[ModelPipeline, str | Path], list[str]]] = {
    "center": run_step_center,
    "dummy": run_step_dummy,
    "interaction": run_step_interaction,
    "logistic-regression": run_step_logistic_regression,
    "rcs": run_step_rcs,
    "your-step-name": run_step_your_step_name,
}

Important

The dictionary key ("your-step-name") must match the exact step name as it appears in the Model Parameters specification and in users’ model-steps.csv files.
The function name (run_step_your_step_name) uses underscores instead of hyphens.

Step 2: Create the Step Function

Create a new module src/model_parameters_pipeline/steps/{stepname}.py containing a function named run_step_{stepname} that implements the transformation logic.

Create the Module

Create a new file at src/model_parameters_pipeline/steps/{stepname}.py (replace {stepname} with your step name, using underscores).

Function Template

Use this template as a starting point:

"""{Step Name} transformation step.

{Brief description of what this step does and its purpose}.
"""

from __future__ import annotations

from pathlib import Path
from typing import TYPE_CHECKING

from model_parameters_pipeline._utils import verify_columns

if TYPE_CHECKING:
    from model_parameters_pipeline.pipeline import ModelPipeline


def run_step_{stepname}(mod: ModelPipeline, file: str | Path) -> list[str]:
    """Run {step name} transformation step.

    Args:
        mod: ModelPipeline instance (mutated in place).
        file: Path to {step name} step specification CSV.

    Returns:
        List of output column names created by this step.
    """
    mod._add_file(file)
    step_data = mod._get_file(file)

    verify_columns(
        step_data,
        ["column1", "column2", "column3"],
        "{stepname} step file",
        file,
    )

    output_columns: list[str] = []

    for _, row in step_data.iterrows():
        # Extract parameters from the specification
        param1 = row["column1"]
        param2 = row["column2"]
        param3 = row["column3"]

        # Implement your transformation logic here
        # Example: mod.data[new_column] = mod.data[existing_column] * param
        output_columns.append(new_column)

    return output_columns

Key Components Explained

File Location: Create your step function in src/model_parameters_pipeline/steps/{stepname}.py.
Function Signature:
- Always takes mod (ModelPipeline instance) and file (path to specification file)
- Input data is accessed and modified via mod.data (a pandas DataFrame)
- Always returns a list[str] of output column names
- The ModelPipeline import is under TYPE_CHECKING to avoid circular imports
Load Specification File:
```
mod._add_file(file)
step_data = mod._get_file(file)
```
Always use these ModelPipeline methods to load files – never read files directly (e.g. with pd.read_csv). _add_file and _get_file ensure that any file path is validated against the sandbox path, if one was specified when constructing the ModelPipeline. This prevents steps from reading files outside the permitted directory.
Verify Columns:
```
verify_columns(
    step_data,
    ["column1", "column2", "column3"],
    "{stepname} step file",
    file,
)
```
Validates that the specification file contains all required columns. Update the column list to match your step’s requirements from the Model Parameters documentation. Import verify_columns from model_parameters_pipeline._utils.
Process Each Row: The for _, row in step_data.iterrows() loop processes each row in the specification file. Each row typically defines one transformation to apply.
Access and Write Data:
- Read a column: mod.data[column_name] (returns a pandas Series)
- Write a column: mod.data[new_column] = transformed_values
Track Output Columns: Append each new column name to output_columns so the pipeline knows which columns this step produced.

Example: Center Step

Here’s a real example from the existing codebase (steps/center.py):

def run_step_center(mod: ModelPipeline, file: str | Path) -> list[str]:
    """Run center transformation step.

    Args:
        mod: ModelPipeline instance (mutated in place).
        file: Path to center step specification CSV.

    Returns:
        List of output column names created by this step.
    """
    mod._add_file(file)
    step_data = mod._get_file(file)

    verify_columns(
        step_data,
        ["origVariable", "centerValue", "centeredVariable"],
        "center step file",
        file,
    )

    output_columns: list[str] = []

    for _, row in step_data.iterrows():
        orig_variable = row["origVariable"]
        center_value = row["centerValue"]
        centered_variable = row["centeredVariable"]

        mod.data[centered_variable] = mod.data[orig_variable] - center_value
        output_columns.append(centered_variable)

    return output_columns

This function:

Is defined in its own module steps/center.py
Accepts mod (a ModelPipeline instance) and file; data is accessed via mod.data
Loads the center specification file using mod._add_file / mod._get_file
Verifies it has the required columns (origVariable, centerValue, centeredVariable)
For each row, creates a new centered variable by subtracting centerValue from the original variable in mod.data
Returns a list of the new column names

Step 3: Add Unit Tests

Unit tests ensure your transformation step works correctly. The testing framework automatically discovers and runs tests based on directory structure.

See Model Parameters Step Tests for complete instructions.

Reference Documentation

For detailed information about Model Parameters transformation steps and their required file formats, see:

Checklist

Use this checklist when adding a new transformation step:

Create new module: src/model_parameters_pipeline/steps/{stepname}.py
Implement run_step_{stepname} function with type hints and docstring
Import and add entry to _STEP_DISPATCH in pipeline.py
Verify column names match the Model Parameters specification
Create test directory: tests/testdata/step-tests/test-{stepname}/
Create test-model-export.csv in test directory
Create test-model-steps.csv in test directory
Create test-{stepname}.csv with test parameters in test directory
Generate expected output using generate_step_tests_expected()
Run pytest to verify tests pass
Review and commit all changes including test-expected.csv

Common Patterns

Parsing Delimited Strings

Some steps use delimited strings (e.g., “var1;var2;var3”) in their parameters file. Use the helper function:

from model_parameters_pipeline._utils import get_string_parts

parts = get_string_parts(row["columnName"])

Working with Numeric Values

Convert string values to numeric when needed:

from model_parameters_pipeline._utils import get_string_parts

numeric_values = [float(v) for v in get_string_parts(row["knots"])]

Creating New Columns Safely

To avoid column name conflicts:

from model_parameters_pipeline._utils import get_unused_column

new_col = get_unused_column(mod.data, "prefix_")

Adding Multiple Columns

You can add multiple columns at once using numpy array assignment:

import numpy as np

# vals is a 2D numpy array with one column per variable
vals = some_computation(mod.data[variable])
for col_idx, var_name in enumerate(variable_names):
    mod.data[var_name] = vals[:, col_idx]

Getting Help

For Model Parameters specification questions, refer to the Model Parameters documentation
For existing step implementation examples, see modules like steps/center.py, steps/dummy.py, etc.
For testing questions, see tests/testdata/step-tests/README.md