Adding a New Transformation Step

This guide explains how to add support for a new transformation step to the Model Parameters Pipeline, as defined by the Model Parameters repository.

Overview

Adding a new transformation step requires three main tasks:

  1. Register the step in the _STEP_DISPATCH dictionary in pipeline.py

  2. Create a new step module – Implement run_step_{stepname} in src/model_parameters_pipeline/steps/{stepname}.py

  3. Add unit tests – Create test files to verify correct behavior

Step 1: Register the Step in _STEP_DISPATCH

The pipeline.py module contains a _STEP_DISPATCH dictionary that maps step names to their implementation functions. The run method uses this dictionary to dispatch each step.

Location

Find the _STEP_DISPATCH dictionary near the top of pipeline.py:

_STEP_DISPATCH: dict[str, Callable[[ModelPipeline, str | Path], list[str]]] = {
    "center": run_step_center,
    "dummy": run_step_dummy,
    "interaction": run_step_interaction,
    "logistic-regression": run_step_logistic_regression,
    "rcs": run_step_rcs,
}

Add Your Step

  1. Import your new step function at the top of the file:

    from model_parameters_pipeline.steps.your_step_name import run_step_your_step_name
    
  2. Add a new entry to _STEP_DISPATCH:

    _STEP_DISPATCH: dict[str, Callable[[ModelPipeline, str | Path], list[str]]] = {
        "center": run_step_center,
        "dummy": run_step_dummy,
        "interaction": run_step_interaction,
        "logistic-regression": run_step_logistic_regression,
        "rcs": run_step_rcs,
        "your-step-name": run_step_your_step_name,
    }
    

Important

  • The dictionary key ("your-step-name") must match the exact step name as it appears in the Model Parameters specification and in users’ model-steps.csv files.

  • The function name (run_step_your_step_name) uses underscores instead of hyphens.

Step 2: Create the Step Function

Create a new module src/model_parameters_pipeline/steps/{stepname}.py containing a function named run_step_{stepname} that implements the transformation logic.

Create the Module

Create a new file at src/model_parameters_pipeline/steps/{stepname}.py (replace {stepname} with your step name, using underscores).

Function Template

Use this template as a starting point:

"""{Step Name} transformation step.

{Brief description of what this step does and its purpose}.
"""

from __future__ import annotations

from pathlib import Path
from typing import TYPE_CHECKING

from model_parameters_pipeline._utils import verify_columns

if TYPE_CHECKING:
    from model_parameters_pipeline.pipeline import ModelPipeline


def run_step_{stepname}(mod: ModelPipeline, file: str | Path) -> list[str]:
    """Run {step name} transformation step.

    Args:
        mod: ModelPipeline instance (mutated in place).
        file: Path to {step name} step specification CSV.

    Returns:
        List of output column names created by this step.
    """
    mod._add_file(file)
    step_data = mod._get_file(file)

    verify_columns(
        step_data,
        ["column1", "column2", "column3"],
        "{stepname} step file",
        file,
    )

    output_columns: list[str] = []

    for _, row in step_data.iterrows():
        # Extract parameters from the specification
        param1 = row["column1"]
        param2 = row["column2"]
        param3 = row["column3"]

        # Implement your transformation logic here
        # Example: mod.data[new_column] = mod.data[existing_column] * param
        output_columns.append(new_column)

    return output_columns

Key Components Explained

  1. File Location: Create your step function in src/model_parameters_pipeline/steps/{stepname}.py.

  2. Function Signature:

    • Always takes mod (ModelPipeline instance) and file (path to specification file)

    • Input data is accessed and modified via mod.data (a pandas DataFrame)

    • Always returns a list[str] of output column names

    • The ModelPipeline import is under TYPE_CHECKING to avoid circular imports

  3. Load Specification File:

    mod._add_file(file)
    step_data = mod._get_file(file)
    

    Always use these ModelPipeline methods to load files – never read files directly (e.g. with pd.read_csv). _add_file and _get_file ensure that any file path is validated against the sandbox path, if one was specified when constructing the ModelPipeline. This prevents steps from reading files outside the permitted directory.

  4. Verify Columns:

    verify_columns(
        step_data,
        ["column1", "column2", "column3"],
        "{stepname} step file",
        file,
    )
    

    Validates that the specification file contains all required columns. Update the column list to match your step’s requirements from the Model Parameters documentation. Import verify_columns from model_parameters_pipeline._utils.

  5. Process Each Row: The for _, row in step_data.iterrows() loop processes each row in the specification file. Each row typically defines one transformation to apply.

  6. Access and Write Data:

    • Read a column: mod.data[column_name] (returns a pandas Series)

    • Write a column: mod.data[new_column] = transformed_values

  7. Track Output Columns: Append each new column name to output_columns so the pipeline knows which columns this step produced.

Example: Center Step

Here’s a real example from the existing codebase (steps/center.py):

def run_step_center(mod: ModelPipeline, file: str | Path) -> list[str]:
    """Run center transformation step.

    Args:
        mod: ModelPipeline instance (mutated in place).
        file: Path to center step specification CSV.

    Returns:
        List of output column names created by this step.
    """
    mod._add_file(file)
    step_data = mod._get_file(file)

    verify_columns(
        step_data,
        ["origVariable", "centerValue", "centeredVariable"],
        "center step file",
        file,
    )

    output_columns: list[str] = []

    for _, row in step_data.iterrows():
        orig_variable = row["origVariable"]
        center_value = row["centerValue"]
        centered_variable = row["centeredVariable"]

        mod.data[centered_variable] = mod.data[orig_variable] - center_value
        output_columns.append(centered_variable)

    return output_columns

This function:

  • Is defined in its own module steps/center.py

  • Accepts mod (a ModelPipeline instance) and file; data is accessed via mod.data

  • Loads the center specification file using mod._add_file / mod._get_file

  • Verifies it has the required columns (origVariable, centerValue, centeredVariable)

  • For each row, creates a new centered variable by subtracting centerValue from the original variable in mod.data

  • Returns a list of the new column names

Step 3: Add Unit Tests

Unit tests ensure your transformation step works correctly. The testing framework automatically discovers and runs tests based on directory structure.

See Model Parameters Step Tests for complete instructions.

Reference Documentation

For detailed information about Model Parameters transformation steps and their required file formats, see:

Checklist

Use this checklist when adding a new transformation step:

  • Create new module: src/model_parameters_pipeline/steps/{stepname}.py

  • Implement run_step_{stepname} function with type hints and docstring

  • Import and add entry to _STEP_DISPATCH in pipeline.py

  • Verify column names match the Model Parameters specification

  • Create test directory: tests/testdata/step-tests/test-{stepname}/

  • Create test-model-export.csv in test directory

  • Create test-model-steps.csv in test directory

  • Create test-{stepname}.csv with test parameters in test directory

  • Generate expected output using generate_step_tests_expected()

  • Run pytest to verify tests pass

  • Review and commit all changes including test-expected.csv

Common Patterns

Parsing Delimited Strings

Some steps use delimited strings (e.g., “var1;var2;var3”) in their parameters file. Use the helper function:

from model_parameters_pipeline._utils import get_string_parts

parts = get_string_parts(row["columnName"])

Working with Numeric Values

Convert string values to numeric when needed:

from model_parameters_pipeline._utils import get_string_parts

numeric_values = [float(v) for v in get_string_parts(row["knots"])]

Creating New Columns Safely

To avoid column name conflicts:

from model_parameters_pipeline._utils import get_unused_column

new_col = get_unused_column(mod.data, "prefix_")

Adding Multiple Columns

You can add multiple columns at once using numpy array assignment:

import numpy as np

# vals is a 2D numpy array with one column per variable
vals = some_computation(mod.data[variable])
for col_idx, var_name in enumerate(variable_names):
    mod.data[var_name] = vals[:, col_idx]

Getting Help

  • For Model Parameters specification questions, refer to the Model Parameters documentation

  • For existing step implementation examples, see modules like steps/center.py, steps/dummy.py, etc.

  • For testing questions, see tests/testdata/step-tests/README.md