Adding a New Transformation Step ================================ This guide explains how to add support for a new transformation step to the Model Parameters Pipeline, as defined by the `Model Parameters repository `_. Overview -------- Adding a new transformation step requires three main tasks: 1. **Register the step** in the ``_STEP_DISPATCH`` dictionary in ``pipeline.py`` 2. **Create a new step module** -- Implement ``run_step_{stepname}`` in ``src/model_parameters_pipeline/steps/{stepname}.py`` 3. **Add unit tests** -- Create test files to verify correct behavior Step 1: Register the Step in ``_STEP_DISPATCH`` ------------------------------------------------ The ``pipeline.py`` module contains a ``_STEP_DISPATCH`` dictionary that maps step names to their implementation functions. The ``run`` method uses this dictionary to dispatch each step. Location ^^^^^^^^ Find the ``_STEP_DISPATCH`` dictionary near the top of ``pipeline.py``: .. code-block:: python _STEP_DISPATCH: dict[str, Callable[[ModelPipeline, str | Path], list[str]]] = { "center": run_step_center, "dummy": run_step_dummy, "interaction": run_step_interaction, "logistic-regression": run_step_logistic_regression, "rcs": run_step_rcs, } Add Your Step ^^^^^^^^^^^^^ 1. Import your new step function at the top of the file: .. code-block:: python from model_parameters_pipeline.steps.your_step_name import run_step_your_step_name 2. Add a new entry to ``_STEP_DISPATCH``: .. code-block:: python _STEP_DISPATCH: dict[str, Callable[[ModelPipeline, str | Path], list[str]]] = { "center": run_step_center, "dummy": run_step_dummy, "interaction": run_step_interaction, "logistic-regression": run_step_logistic_regression, "rcs": run_step_rcs, "your-step-name": run_step_your_step_name, } .. important:: - The dictionary key (``"your-step-name"``) must match the exact step name as it appears in the Model Parameters specification and in users' ``model-steps.csv`` files. - The function name (``run_step_your_step_name``) uses underscores instead of hyphens. Step 2: Create the Step Function --------------------------------- Create a new module ``src/model_parameters_pipeline/steps/{stepname}.py`` containing a function named ``run_step_{stepname}`` that implements the transformation logic. Create the Module ^^^^^^^^^^^^^^^^^ Create a new file at ``src/model_parameters_pipeline/steps/{stepname}.py`` (replace ``{stepname}`` with your step name, using underscores). Function Template ^^^^^^^^^^^^^^^^^ Use this template as a starting point: .. code-block:: python """{Step Name} transformation step. {Brief description of what this step does and its purpose}. """ from __future__ import annotations from pathlib import Path from typing import TYPE_CHECKING from model_parameters_pipeline._utils import verify_columns if TYPE_CHECKING: from model_parameters_pipeline.pipeline import ModelPipeline def run_step_{stepname}(mod: ModelPipeline, file: str | Path) -> list[str]: """Run {step name} transformation step. Args: mod: ModelPipeline instance (mutated in place). file: Path to {step name} step specification CSV. Returns: List of output column names created by this step. """ mod._add_file(file) step_data = mod._get_file(file) verify_columns( step_data, ["column1", "column2", "column3"], "{stepname} step file", file, ) output_columns: list[str] = [] for _, row in step_data.iterrows(): # Extract parameters from the specification param1 = row["column1"] param2 = row["column2"] param3 = row["column3"] # Implement your transformation logic here # Example: mod.data[new_column] = mod.data[existing_column] * param output_columns.append(new_column) return output_columns Key Components Explained ^^^^^^^^^^^^^^^^^^^^^^^^ 1. **File Location**: Create your step function in ``src/model_parameters_pipeline/steps/{stepname}.py``. 2. **Function Signature**: - Always takes ``mod`` (``ModelPipeline`` instance) and ``file`` (path to specification file) - Input data is accessed and modified via ``mod.data`` (a pandas DataFrame) - Always returns a ``list[str]`` of output column names - The ``ModelPipeline`` import is under ``TYPE_CHECKING`` to avoid circular imports 3. **Load Specification File**: .. code-block:: python mod._add_file(file) step_data = mod._get_file(file) Always use these ``ModelPipeline`` methods to load files -- never read files directly (e.g. with ``pd.read_csv``). ``_add_file`` and ``_get_file`` ensure that any file path is validated against the sandbox path, if one was specified when constructing the ``ModelPipeline``. This prevents steps from reading files outside the permitted directory. 4. **Verify Columns**: .. code-block:: python verify_columns( step_data, ["column1", "column2", "column3"], "{stepname} step file", file, ) Validates that the specification file contains all required columns. Update the column list to match your step's requirements from the Model Parameters documentation. Import ``verify_columns`` from ``model_parameters_pipeline._utils``. 5. **Process Each Row**: The ``for _, row in step_data.iterrows()`` loop processes each row in the specification file. Each row typically defines one transformation to apply. 6. **Access and Write Data**: - Read a column: ``mod.data[column_name]`` (returns a pandas Series) - Write a column: ``mod.data[new_column] = transformed_values`` 7. **Track Output Columns**: Append each new column name to ``output_columns`` so the pipeline knows which columns this step produced. Example: Center Step ^^^^^^^^^^^^^^^^^^^^ Here's a real example from the existing codebase (``steps/center.py``): .. code-block:: python def run_step_center(mod: ModelPipeline, file: str | Path) -> list[str]: """Run center transformation step. Args: mod: ModelPipeline instance (mutated in place). file: Path to center step specification CSV. Returns: List of output column names created by this step. """ mod._add_file(file) step_data = mod._get_file(file) verify_columns( step_data, ["origVariable", "centerValue", "centeredVariable"], "center step file", file, ) output_columns: list[str] = [] for _, row in step_data.iterrows(): orig_variable = row["origVariable"] center_value = row["centerValue"] centered_variable = row["centeredVariable"] mod.data[centered_variable] = mod.data[orig_variable] - center_value output_columns.append(centered_variable) return output_columns This function: - Is defined in its own module ``steps/center.py`` - Accepts ``mod`` (a ``ModelPipeline`` instance) and ``file``; data is accessed via ``mod.data`` - Loads the center specification file using ``mod._add_file`` / ``mod._get_file`` - Verifies it has the required columns (``origVariable``, ``centerValue``, ``centeredVariable``) - For each row, creates a new centered variable by subtracting ``centerValue`` from the original variable in ``mod.data`` - Returns a list of the new column names Step 3: Add Unit Tests ----------------------- Unit tests ensure your transformation step works correctly. The testing framework automatically discovers and runs tests based on directory structure. See :doc:`step-tests` for complete instructions. Reference Documentation ----------------------- For detailed information about Model Parameters transformation steps and their required file formats, see: - `Model Parameters Reference Documentation `_ - :doc:`step-tests` Checklist --------- Use this checklist when adding a new transformation step: - Create new module: ``src/model_parameters_pipeline/steps/{stepname}.py`` - Implement ``run_step_{stepname}`` function with type hints and docstring - Import and add entry to ``_STEP_DISPATCH`` in ``pipeline.py`` - Verify column names match the Model Parameters specification - Create test directory: ``tests/testdata/step-tests/test-{stepname}/`` - Create ``test-model-export.csv`` in test directory - Create ``test-model-steps.csv`` in test directory - Create ``test-{stepname}.csv`` with test parameters in test directory - Generate expected output using ``generate_step_tests_expected()`` - Run ``pytest`` to verify tests pass - Review and commit all changes including ``test-expected.csv`` Common Patterns --------------- Parsing Delimited Strings ^^^^^^^^^^^^^^^^^^^^^^^^^^ Some steps use delimited strings (e.g., "var1;var2;var3") in their parameters file. Use the helper function: .. code-block:: python from model_parameters_pipeline._utils import get_string_parts parts = get_string_parts(row["columnName"]) Working with Numeric Values ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Convert string values to numeric when needed: .. code-block:: python from model_parameters_pipeline._utils import get_string_parts numeric_values = [float(v) for v in get_string_parts(row["knots"])] Creating New Columns Safely ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ To avoid column name conflicts: .. code-block:: python from model_parameters_pipeline._utils import get_unused_column new_col = get_unused_column(mod.data, "prefix_") Adding Multiple Columns ^^^^^^^^^^^^^^^^^^^^^^^^ You can add multiple columns at once using numpy array assignment: .. code-block:: python import numpy as np # vals is a 2D numpy array with one column per variable vals = some_computation(mod.data[variable]) for col_idx, var_name in enumerate(variable_names): mod.data[var_name] = vals[:, col_idx] Getting Help ------------ - For Model Parameters specification questions, refer to the `Model Parameters documentation `_ - For existing step implementation examples, see modules like ``steps/center.py``, ``steps/dummy.py``, etc. - For testing questions, see ``tests/testdata/step-tests/README.md``