Adding a New Transformation Step
================================
This guide explains how to add support for a new transformation step to the
Model Parameters Pipeline, as defined by the `Model Parameters repository
`_.
Overview
--------
Adding a new transformation step requires three main tasks:
1. **Register the step** in the ``_STEP_DISPATCH`` dictionary in
``pipeline.py``
2. **Create a new step module** -- Implement ``run_step_{stepname}`` in
``src/model_parameters_pipeline/steps/{stepname}.py``
3. **Add unit tests** -- Create test files to verify correct behavior
Step 1: Register the Step in ``_STEP_DISPATCH``
------------------------------------------------
The ``pipeline.py`` module contains a ``_STEP_DISPATCH`` dictionary that maps
step names to their implementation functions. The ``run`` method uses this
dictionary to dispatch each step.
Location
^^^^^^^^
Find the ``_STEP_DISPATCH`` dictionary near the top of ``pipeline.py``:
.. code-block:: python
_STEP_DISPATCH: dict[str, Callable[[ModelPipeline, str | Path], list[str]]] = {
"center": run_step_center,
"dummy": run_step_dummy,
"interaction": run_step_interaction,
"logistic-regression": run_step_logistic_regression,
"rcs": run_step_rcs,
}
Add Your Step
^^^^^^^^^^^^^
1. Import your new step function at the top of the file:
.. code-block:: python
from model_parameters_pipeline.steps.your_step_name import run_step_your_step_name
2. Add a new entry to ``_STEP_DISPATCH``:
.. code-block:: python
_STEP_DISPATCH: dict[str, Callable[[ModelPipeline, str | Path], list[str]]] = {
"center": run_step_center,
"dummy": run_step_dummy,
"interaction": run_step_interaction,
"logistic-regression": run_step_logistic_regression,
"rcs": run_step_rcs,
"your-step-name": run_step_your_step_name,
}
.. important::
- The dictionary key (``"your-step-name"``) must match the exact step name
as it appears in the Model Parameters specification and in users'
``model-steps.csv`` files.
- The function name (``run_step_your_step_name``) uses underscores instead
of hyphens.
Step 2: Create the Step Function
---------------------------------
Create a new module ``src/model_parameters_pipeline/steps/{stepname}.py``
containing a function named ``run_step_{stepname}`` that implements the
transformation logic.
Create the Module
^^^^^^^^^^^^^^^^^
Create a new file at ``src/model_parameters_pipeline/steps/{stepname}.py``
(replace ``{stepname}`` with your step name, using underscores).
Function Template
^^^^^^^^^^^^^^^^^
Use this template as a starting point:
.. code-block:: python
"""{Step Name} transformation step.
{Brief description of what this step does and its purpose}.
"""
from __future__ import annotations
from pathlib import Path
from typing import TYPE_CHECKING
from model_parameters_pipeline._utils import verify_columns
if TYPE_CHECKING:
from model_parameters_pipeline.pipeline import ModelPipeline
def run_step_{stepname}(mod: ModelPipeline, file: str | Path) -> list[str]:
"""Run {step name} transformation step.
Args:
mod: ModelPipeline instance (mutated in place).
file: Path to {step name} step specification CSV.
Returns:
List of output column names created by this step.
"""
mod._add_file(file)
step_data = mod._get_file(file)
verify_columns(
step_data,
["column1", "column2", "column3"],
"{stepname} step file",
file,
)
output_columns: list[str] = []
for _, row in step_data.iterrows():
# Extract parameters from the specification
param1 = row["column1"]
param2 = row["column2"]
param3 = row["column3"]
# Implement your transformation logic here
# Example: mod.data[new_column] = mod.data[existing_column] * param
output_columns.append(new_column)
return output_columns
Key Components Explained
^^^^^^^^^^^^^^^^^^^^^^^^
1. **File Location**: Create your step function in
``src/model_parameters_pipeline/steps/{stepname}.py``.
2. **Function Signature**:
- Always takes ``mod`` (``ModelPipeline`` instance) and ``file`` (path to
specification file)
- Input data is accessed and modified via ``mod.data`` (a pandas DataFrame)
- Always returns a ``list[str]`` of output column names
- The ``ModelPipeline`` import is under ``TYPE_CHECKING`` to avoid circular
imports
3. **Load Specification File**:
.. code-block:: python
mod._add_file(file)
step_data = mod._get_file(file)
Always use these ``ModelPipeline`` methods to load files -- never read files
directly (e.g. with ``pd.read_csv``). ``_add_file`` and ``_get_file`` ensure
that any file path is validated against the sandbox path, if one was
specified when constructing the ``ModelPipeline``. This prevents steps from
reading files outside the permitted directory.
4. **Verify Columns**:
.. code-block:: python
verify_columns(
step_data,
["column1", "column2", "column3"],
"{stepname} step file",
file,
)
Validates that the specification file contains all required columns. Update
the column list to match your step's requirements from the Model Parameters
documentation. Import ``verify_columns`` from
``model_parameters_pipeline._utils``.
5. **Process Each Row**: The ``for _, row in step_data.iterrows()`` loop
processes each row in the specification file. Each row typically defines one
transformation to apply.
6. **Access and Write Data**:
- Read a column: ``mod.data[column_name]`` (returns a pandas Series)
- Write a column: ``mod.data[new_column] = transformed_values``
7. **Track Output Columns**: Append each new column name to
``output_columns`` so the pipeline knows which columns this step produced.
Example: Center Step
^^^^^^^^^^^^^^^^^^^^
Here's a real example from the existing codebase (``steps/center.py``):
.. code-block:: python
def run_step_center(mod: ModelPipeline, file: str | Path) -> list[str]:
"""Run center transformation step.
Args:
mod: ModelPipeline instance (mutated in place).
file: Path to center step specification CSV.
Returns:
List of output column names created by this step.
"""
mod._add_file(file)
step_data = mod._get_file(file)
verify_columns(
step_data,
["origVariable", "centerValue", "centeredVariable"],
"center step file",
file,
)
output_columns: list[str] = []
for _, row in step_data.iterrows():
orig_variable = row["origVariable"]
center_value = row["centerValue"]
centered_variable = row["centeredVariable"]
mod.data[centered_variable] = mod.data[orig_variable] - center_value
output_columns.append(centered_variable)
return output_columns
This function:
- Is defined in its own module ``steps/center.py``
- Accepts ``mod`` (a ``ModelPipeline`` instance) and ``file``; data is accessed
via ``mod.data``
- Loads the center specification file using ``mod._add_file`` /
``mod._get_file``
- Verifies it has the required columns (``origVariable``, ``centerValue``,
``centeredVariable``)
- For each row, creates a new centered variable by subtracting ``centerValue``
from the original variable in ``mod.data``
- Returns a list of the new column names
Step 3: Add Unit Tests
-----------------------
Unit tests ensure your transformation step works correctly. The testing
framework automatically discovers and runs tests based on directory structure.
See :doc:`step-tests` for complete instructions.
Reference Documentation
-----------------------
For detailed information about Model Parameters transformation steps and their
required file formats, see:
- `Model Parameters Reference Documentation
`_
- :doc:`step-tests`
Checklist
---------
Use this checklist when adding a new transformation step:
- Create new module: ``src/model_parameters_pipeline/steps/{stepname}.py``
- Implement ``run_step_{stepname}`` function with type hints and docstring
- Import and add entry to ``_STEP_DISPATCH`` in ``pipeline.py``
- Verify column names match the Model Parameters specification
- Create test directory: ``tests/testdata/step-tests/test-{stepname}/``
- Create ``test-model-export.csv`` in test directory
- Create ``test-model-steps.csv`` in test directory
- Create ``test-{stepname}.csv`` with test parameters in test directory
- Generate expected output using ``generate_step_tests_expected()``
- Run ``pytest`` to verify tests pass
- Review and commit all changes including ``test-expected.csv``
Common Patterns
---------------
Parsing Delimited Strings
^^^^^^^^^^^^^^^^^^^^^^^^^^
Some steps use delimited strings (e.g., "var1;var2;var3") in their parameters
file. Use the helper function:
.. code-block:: python
from model_parameters_pipeline._utils import get_string_parts
parts = get_string_parts(row["columnName"])
Working with Numeric Values
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Convert string values to numeric when needed:
.. code-block:: python
from model_parameters_pipeline._utils import get_string_parts
numeric_values = [float(v) for v in get_string_parts(row["knots"])]
Creating New Columns Safely
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
To avoid column name conflicts:
.. code-block:: python
from model_parameters_pipeline._utils import get_unused_column
new_col = get_unused_column(mod.data, "prefix_")
Adding Multiple Columns
^^^^^^^^^^^^^^^^^^^^^^^^
You can add multiple columns at once using numpy array assignment:
.. code-block:: python
import numpy as np
# vals is a 2D numpy array with one column per variable
vals = some_computation(mod.data[variable])
for col_idx, var_name in enumerate(variable_names):
mod.data[var_name] = vals[:, col_idx]
Getting Help
------------
- For Model Parameters specification questions, refer to the `Model Parameters
documentation `_
- For existing step implementation examples, see modules like
``steps/center.py``, ``steps/dummy.py``, etc.
- For testing questions, see ``tests/testdata/step-tests/README.md``