Configuring the Preprocessor

Backend Configuration

The base BackendConfig contains fields that are leveraged by internal functions for processing input data prior to passing inputs to the transform_inputs() method. The base config arguments are initialized at the Backend’s base_config_model class attribute, and they can be accessed via the config instance attribute in the custom Inference Backend.

The following fields are used for default behaviors of the Base Config Model:

  • verbose: A boolean indicating whether to output verbose logs. Defaults to True.

  • input_format: A string specifying the preprocessor; one of 'passthrough', 'records', or 'numpy'. For details, see Preprocessors.

  • rename_fields: A dictionary mapping of {"old_name": "new_name"} which will be renamed during 'records' or 'numpy' preprocessing.

  • feature_names: A list of feature names. If non-empty, acts as a preprocessing filter. Behavior varies between 'records' and 'numpy' preprocessors. Defaults to an empty list.

  • flatten_nested_inputs: A boolean indicating whether to flatten nested inputs. Defaults to False.

  • flatten_lists: A boolean indicating whether to also flatten lists when flattening nested inputs. Defaults to False.

  • nested_field_delimiter: A string indicating the delimiter for nested fields. Defaults to a period (‘.’).

Warning

When flatten_nested_inputs is False, input keys containing nested_field_delimiter may result in incorrect nested structures or key collisions. For best results, ensure delimiters do not appear in record keys.

Example

Consider the following preprocessor configuration:

/path/to/a/config.json
 1{
 2    "configs": {
 3        "flatten_nested_inputs": true,
 4        "nested_field_delimiter": ":",
 5        "rename_fields": {
 6            "foo:bar": "feature_1",
 7            "fizz:buzz": "feature_2"
 8        },
 9        "feature_names": [
10            "feature_1",
11            "feature_2"
12        ]
13    }
14}

and this InferenceBackend implementation that simply prints the data at each stage:

inference.py
 1from packflow import InferenceBackend
 2
 3
 4class PrintBackend(InferenceBackend):
 5    def __call__(self, *args, **kwargs):
 6        # Normally, __call__ should not be overridden in a backend.
 7        # This is just to show the inputs before the internal preprocessors run.
 8        print(f"PrintBackend called with args: {args}, kwargs: {kwargs}")
 9        return super().__call__(*args, **kwargs)
10
11    def transform_inputs(self, inputs):
12        print(f"Transform Inputs received: {inputs}")
13        return inputs
14
15    def execute(self, inputs):
16        print(f"Execute received: {inputs}")
17        # Config fields are available in the backend through self.config
18        # For demonstration, this will just encapsulate each record in a dict with a "result" key.
19        return [{"result": record} for record in inputs]
20
21
22if __name__ == "__main__":
23    backend = PrintBackend()
24    sample_input = {"foo": {"bar": 0, "baz": [1]}, "fizz": {"buzz": 2}}
25    print("Final Output:", backend(sample_input))

When the above backend is executed without configuration, the input from the __main__ block {"foo": {"bar": 0, "baz": [1]}, "fizz": {"buzz": 2}} will pass through unchanged:

$ python inference.py
PrintBackend called with args: ({'foo': {'bar': 0, 'baz': [1]}, 'fizz': {'buzz': 2}},), kwargs: {}
Transform Inputs received: [{'foo': {'bar': 0, 'baz': [1]}, 'fizz': {'buzz': 2}}]
Execute received: [{'foo': {'bar': 0, 'baz': [1]}, 'fizz': {'buzz': 2}}]
Final Output: {'result': {'foo': {'bar': 0, 'baz': [1]}, 'fizz': {'buzz': 2}}}

However, when the backend is executed with the above config, the input will be transformed according to the preprocessor configuration:

$ BACKEND_CONFIG_FILE_PATH=/path/to/a/config.json python inference.py
PrintBackend called with args: ({'foo': {'bar': 0, 'baz': [1]}, 'fizz': {'buzz': 2}},), kwargs: {}
Transform Inputs received: [{'feature_1': 0, 'feature_2': 2}]
Execute received: [{'feature_1': 0, 'feature_2': 2}]
Final Output: {'result': {'feature_1': 0, 'feature_2': 2}}

The input record has been flattened, filtered to only include the specified feature names, and renamed according to the config file. This demonstrates how the preprocessor configuration fields can be used to manipulate input data before it reaches the core logic of the InferenceBackend, allowing for an InferenceBackend to be reused across different data schemas with minimal code changes.

Please see Config Hierarchy for more details on how configurations are loaded and overridden.

Preprocessors

Preprocessing occurs in tandem to the Inference Backend framework to assist with streamlining development. Each preprocessor relies on different Backend Configuration fields. The following subsections outline the required fields and expected behaviors for each preprocessor.

Passthrough Preprocessor

Condition: Used when input_format="passthrough".

Expected Behaviors:

  • Input data is untouched and passed directly to the transform_inputs() method.

Records Preprocessor [Default]

Condition: Used when input_format="records".

Expected Behaviors:

  • If rename_fields has a value:
    • The fields in incoming records will be renamed based on this mapping, prior to any other preprocessing steps.

  • If feature_names is not empty:
    • The records will be filtered and sorted based on the values in this array.

    • This will drastically lower the size of the data passing through the pipeline, which can lead to performance boosts.

  • If flatten_nested_inputs is True:
    • Nested events (e.g., {"foo": {"bar": 0}}) will be ‘flattened’ to {"foo.bar": 0}.

    • The value of the nested_field_delimiter config will determine the delimiter for the flattened fields (default: ‘.’).

    • If flatten_lists is True:
      • The flattening will also include lists.

      • Example: {"foo": {"bar": [0, 1]}} will be flattened to {"foo.bar.0": 0, "foo.bar.1": 1}

Nested Path Access:

When using delimiter notation in rename_fields or feature_names to access nested paths (e.g., "foo.bar" to access {"foo": {"bar": value}}), the preprocessor will traverse the nested structure directly without flattening. If a key in the path does not exist, it will be silently skipped and not included in the output.

Important: When flatten_nested_inputs=False, the preprocessor never flattens the input data. Keys containing the delimiter character are preserved exactly as they appear. Nested path access is performed via direct traversal of the dictionary structure.

Example:

config = BackendConfig(
    flatten_nested_inputs=False,
    nested_field_delimiter=".",
    feature_names=["a.b", "x"]
)

# Input with nested structure
input_data = [{"a": {"b": 10}, "x": 5}]

# Output preserves nested structure for accessed paths
# Result: [{"a": {"b": 10}, "x": 5}]

Delimiter Collision Detection:

When using nested path access (delimiter notation in rename_fields or feature_names) with flatten_nested_inputs=False, the preprocessor will check for keys that contain the delimiter character. This prevents ambiguity between literal keys and nested paths.

config = BackendConfig(
    flatten_nested_inputs=False,
    nested_field_delimiter=".",
    feature_names=["a.b"]  # Nested path access
)

# Input has a key containing the delimiter - this will raise an error
input_data = [{"field.name": 42, "a": {"b": 1}}]
# PreprocessorRuntimeError: Keys containing the delimiter '.' were found: ['field.name']

To bypass this check (not recommended), set ignore_delimiter_collisions=True:

config = BackendConfig(
    flatten_nested_inputs=False,
    nested_field_delimiter=".",
    feature_names=["a.b"],
    ignore_delimiter_collisions=True  # Bypass collision detection
)

Warning

Setting ignore_delimiter_collisions=True may result in undefined behavior if key collisions occur. Use with caution and ensure your data does not contain keys with the delimiter character when using nested path access.

Note

The default BackendConfig values will not trigger any of the above conditions and will fall back to acting as a Passthrough preprocessor for optimization purposes.

Numpy Preprocessor

Condition: Used when input_format="numpy".

Expected Behaviors:

  • If rename_fields has a value:
    • The fields in incoming records will be renamed based on this mapping, prior to conversion to an ndarray.

    • Especially helpful if input names to not match required feature names.

  • Creates a loosely-typed ndarray based on the contents of the feature_names config.
    • Example: If inputs=[{"foo": 0}, {"foo": 1}] and feature_names=["foo"], the data passed to transform_inputs() would be equivalent to numpy.array([[0], [1]]).