{
    "cells": [
        {
            "cell_type": "markdown",
            "id": "b784c0c7-7729-408a-a68f-a2b407ce03e1",
            "metadata": {},
            "source": [
                "# Scikit-Learn Classifier"
            ]
        },
        {
            "cell_type": "markdown",
            "id": "b09ed980-8367-4f2e-a1ba-cc255a6797dd",
            "metadata": {},
            "source": [
                "This Notebook is designed to be an example for developing a modular, reusable Scikit-Learn classification backend. \n",
                "In this guide we will:\n",
                "\n",
                "1. Creating a project with the Poetry\n",
                "2. Train a classifier with Scikit-Learn\n",
                "3. Develop the Inference Backend for running the model with Packflow\n",
                "4. Load and validate the Backend from the installed package\n",
                "\n",
                "## Creating a Project\n",
                "\n",
                "First, We'll install poetry and create a new Project:"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 1,
            "id": "1eb9b828-b480-4c50-86b6-6e434fd08cad",
            "metadata": {},
            "outputs": [
                {
                    "name": "stdout",
                    "output_type": "stream",
                    "text": [
                        "Note: you may need to restart the kernel to use updated packages.\n"
                    ]
                }
            ],
            "source": [
                "%pip install poetry --quiet"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 2,
            "id": "2d09201e-ced1-4eee-a680-39cf483990f7",
            "metadata": {},
            "outputs": [
                {
                    "name": "stdout",
                    "output_type": "stream",
                    "text": [
                        "Created package sklearn_classifier in sklearn_classifier\n"
                    ]
                }
            ],
            "source": [
                "%%sh\n",
                "\n",
                "poetry new sklearn_classifier"
            ]
        },
        {
            "cell_type": "markdown",
            "id": "d2cf8853-9d9b-4d4a-a2cc-669b8360e0dc",
            "metadata": {},
            "source": [
                "Next, we need to install a few dependencies to our poetry project:"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 3,
            "id": "87ae0685-890b-4f05-8ae9-181d88e2c791",
            "metadata": {},
            "outputs": [
                {
                    "name": "stdout",
                    "output_type": "stream",
                    "text": [
                        "Using version ^1.8.0 for scikit-learn\n",
                        "Using version ^1.5.3 for joblib\n",
                        "Using version ^3.0.0 for pandas\n",
                        "\n",
                        "Updating dependencies\n",
                        "Resolving dependencies...\n",
                        "\n",
                        "No dependencies to install or update\n",
                        "\n",
                        "Writing lock file\n"
                    ]
                }
            ],
            "source": [
                "%%sh\n",
                "\n",
                "poetry --directory ./sklearn_classifier add scikit-learn joblib pandas"
            ]
        },
        {
            "cell_type": "markdown",
            "id": "007e6aa3-0141-45b0-92f4-6689bc7d05ab",
            "metadata": {},
            "source": [
                "## Training a Iris Classifier\n",
                "\n",
                "For our sample use-case, we'll use the Scikit-Learn Iris dataset and train a simple Decision Tree Classifier:"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 4,
            "id": "bc223ca1-2fc7-4d29-8b5f-1dcba964e9bc",
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/html": [
                            "<div>\n",
                            "<style scoped>\n",
                            "    .dataframe tbody tr th:only-of-type {\n",
                            "        vertical-align: middle;\n",
                            "    }\n",
                            "\n",
                            "    .dataframe tbody tr th {\n",
                            "        vertical-align: top;\n",
                            "    }\n",
                            "\n",
                            "    .dataframe thead th {\n",
                            "        text-align: right;\n",
                            "    }\n",
                            "</style>\n",
                            "<table border=\"1\" class=\"dataframe\">\n",
                            "  <thead>\n",
                            "    <tr style=\"text-align: right;\">\n",
                            "      <th></th>\n",
                            "      <th>sepal length (cm)</th>\n",
                            "      <th>sepal width (cm)</th>\n",
                            "      <th>petal length (cm)</th>\n",
                            "      <th>petal width (cm)</th>\n",
                            "    </tr>\n",
                            "  </thead>\n",
                            "  <tbody>\n",
                            "    <tr>\n",
                            "      <th>21</th>\n",
                            "      <td>5.1</td>\n",
                            "      <td>3.7</td>\n",
                            "      <td>1.5</td>\n",
                            "      <td>0.4</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>29</th>\n",
                            "      <td>4.7</td>\n",
                            "      <td>3.2</td>\n",
                            "      <td>1.6</td>\n",
                            "      <td>0.2</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>111</th>\n",
                            "      <td>6.4</td>\n",
                            "      <td>2.7</td>\n",
                            "      <td>5.3</td>\n",
                            "      <td>1.9</td>\n",
                            "    </tr>\n",
                            "  </tbody>\n",
                            "</table>\n",
                            "</div>"
                        ],
                        "text/plain": [
                            "     sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)\n",
                            "21                 5.1               3.7                1.5               0.4\n",
                            "29                 4.7               3.2                1.6               0.2\n",
                            "111                6.4               2.7                5.3               1.9"
                        ]
                    },
                    "execution_count": 4,
                    "metadata": {},
                    "output_type": "execute_result"
                }
            ],
            "source": [
                "from sklearn.datasets import load_iris\n",
                "\n",
                "X, y = load_iris(return_X_y=True, as_frame=True)\n",
                "\n",
                "X.sample(3)"
            ]
        },
        {
            "cell_type": "markdown",
            "id": "87289016-2f9f-439f-ab47-267aa9068c3e",
            "metadata": {},
            "source": [
                "For simplicity, we will ignore best practices and train our model on the entire dataset:"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 5,
            "id": "e43682e3-a7f1-4f1c-9fab-7c3972b150e5",
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/plain": [
                            "['sklearn_classifier/src/sklearn_classifier/model.joblib']"
                        ]
                    },
                    "execution_count": 5,
                    "metadata": {},
                    "output_type": "execute_result"
                }
            ],
            "source": [
                "from sklearn.tree import DecisionTreeClassifier\n",
                "import joblib\n",
                "\n",
                "model = DecisionTreeClassifier()\n",
                "\n",
                "model.fit(X, y)\n",
                "\n",
                "joblib.dump(model, \"sklearn_classifier/src/sklearn_classifier/model.joblib\")"
            ]
        },
        {
            "cell_type": "markdown",
            "id": "ba2cc9e3-7a1b-45b6-b76f-36ea116afa35",
            "metadata": {},
            "source": [
                "The model has now been trained (fit) and serialized with joblib to the path output above.\n",
                "\n",
                "## Developing the Inference Backend\n",
                "\n",
                "Now we can develop the Inference Backend for running and sharing the model with Packflow:"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 6,
            "id": "2238de9b-e710-488e-8a84-14e9a749cbc9",
            "metadata": {},
            "outputs": [
                {
                    "name": "stdout",
                    "output_type": "stream",
                    "text": [
                        "Writing sklearn_classifier/src/sklearn_classifier/inference.py\n"
                    ]
                }
            ],
            "source": [
                "%%writefile sklearn_classifier/src/sklearn_classifier/inference.py\n",
                "\n",
                "# -- Packflow imports --\n",
                "from packflow import InferenceBackend, BackendConfig\n",
                "from packflow.utils.normalize import ensure_valid_output\n",
                "\n",
                "# -- Imports that are required to run the model --\n",
                "from pathlib import Path\n",
                "import pandas as pd\n",
                "import sklearn\n",
                "import joblib\n",
                "\n",
                "\n",
                "class SklearnClassifierConfig(BackendConfig):\n",
                "    # Create a config field for where to load the model from\n",
                "    serialized_model_path: str = Path(__file__).resolve().parent.joinpath('model.joblib')\n",
                "    \n",
                "    # Define the default input feature names\n",
                "    feature_names: list[str] = [\n",
                "        'sepal length (cm)', \n",
                "        'sepal width (cm)', \n",
                "        'petal length (cm)', \n",
                "        'petal width (cm)'\n",
                "    ]\n",
                "\n",
                "\n",
                "class Backend(InferenceBackend):\n",
                "    # override the default model with the custom model defined above\n",
                "    backend_config_model = SklearnClassifierConfig\n",
                "\n",
                "    def initialize(self):\n",
                "        self.logger.info(f'Loading model from: {self.config.serialized_model_path}')\n",
                "        self.model = joblib.load(self.config.serialized_model_path)\n",
                "\n",
                "    def transform_inputs(self, inputs):\n",
                "        \"\"\"\n",
                "        Convert input array (this backend uses the Numpy Preprocessor) to a Pandas DataFrame\n",
                "        \"\"\"\n",
                "        return pd.DataFrame(columns=self.config.feature_names, data=inputs)\n",
                "        \n",
                "    \n",
                "    def execute(self, inputs):\n",
                "        \"\"\"\n",
                "        Run the Pandas DataFrame through the loaded model \n",
                "        and return the predicted class.\n",
                "        \"\"\"\n",
                "        return self.model.predict(inputs)\n",
                "\n",
                "    def transform_outputs(self, outputs):\n",
                "        \"\"\"\n",
                "        Use Packflow.dev to convert outputs to safe return types.\n",
                "\n",
                "        Note: \n",
                "            This method is less flexible and does not apply business-logic.\n",
                "            However, for this demo we will assume the output does not need\n",
                "            any special postprocessing.\n",
                "        \"\"\"\n",
                "        return ensure_valid_output(outputs, parent_key='class')\n",
                "\n",
                "\n",
                "# Set defaults for base fields\n",
                "backend = Backend(\n",
                "    input_format='numpy'\n",
                ")"
            ]
        },
        {
            "cell_type": "markdown",
            "id": "bac49ef2-0920-471e-bb56-cd8b12cfac1a",
            "metadata": {},
            "source": [
                "We will also need to add the `inference` module to the package by adding it to the `__init__.py` file so it can be imported:"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 7,
            "id": "79eca476-4c84-4929-969a-b5e673ff901f",
            "metadata": {},
            "outputs": [
                {
                    "name": "stdout",
                    "output_type": "stream",
                    "text": [
                        "Overwriting sklearn_classifier/src/sklearn_classifier/__init__.py\n"
                    ]
                }
            ],
            "source": [
                "%%writefile sklearn_classifier/src/sklearn_classifier/__init__.py\n",
                "\n",
                "from . import inference"
            ]
        },
        {
            "cell_type": "markdown",
            "id": "33eab80d-d9cc-474a-94f1-3bf9e436ffda",
            "metadata": {},
            "source": [
                "Now that we've created an `inference.py` file to our poetry package, we can use Packflow's ModuleLoader to import the backend and run it wherever needed."
            ]
        },
        {
            "cell_type": "markdown",
            "id": "c3b02ecd-97e6-4931-a45a-d3b21c8a892b",
            "metadata": {},
            "source": [
                "## Validating the Inference Backend\n",
                "\n",
                "Now that we've create a snapshot, let's load and validate the backend is running as expected:"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 8,
            "id": "6cc879b6-6611-4e70-84d6-c6758b121502",
            "metadata": {},
            "outputs": [
                {
                    "name": "stdout",
                    "output_type": "stream",
                    "text": [
                        "Note: you may need to restart the kernel to use updated packages.\n"
                    ]
                }
            ],
            "source": [
                "%pip install ./sklearn_classifier --quiet"
            ]
        },
        {
            "cell_type": "markdown",
            "id": "176d960e-17a4-465f-a391-7a4c0984680b",
            "metadata": {},
            "source": [
                "## IMPORTANT\n",
                "You will likely need to restart the kernel in this notebook to proceed with loading and running the inference backend!"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 9,
            "id": "f17b2c78-d096-405b-ab8b-4f70879c0051",
            "metadata": {},
            "outputs": [
                {
                    "name": "stderr",
                    "output_type": "stream",
                    "text": [
                        "\u001b[32m2026-01-21 14:16:08.037\u001b[0m | \u001b[34m\u001b[1mDEBUG   \u001b[0m | \u001b[36mpackflow.utils.normalize.base\u001b[0m:\u001b[36m_import_module\u001b[0m:\u001b[36m30\u001b[0m - \u001b[34m\u001b[1mTorchScalarHandler Type Converter is not available. Reason: No module named 'torch'\u001b[0m\n",
                        "\u001b[32m2026-01-21 14:16:08.038\u001b[0m | \u001b[34m\u001b[1mDEBUG   \u001b[0m | \u001b[36mpackflow.utils.normalize.base\u001b[0m:\u001b[36m_import_module\u001b[0m:\u001b[36m30\u001b[0m - \u001b[34m\u001b[1mTorchTensorHandler Type Converter is not available. Reason: No module named 'torch'\u001b[0m\n",
                        "\u001b[32m2026-01-21 14:16:08.039\u001b[0m | \u001b[34m\u001b[1mDEBUG   \u001b[0m | \u001b[36mpackflow.utils.normalize.base\u001b[0m:\u001b[36m_import_module\u001b[0m:\u001b[36m30\u001b[0m - \u001b[34m\u001b[1mPillowImageHandler Type Converter is not available. Reason: No module named 'PIL'\u001b[0m\n",
                        "\u001b[32m2026-01-21 14:16:08.059\u001b[0m | \u001b[34m\u001b[1mDEBUG   \u001b[0m | \u001b[36mpackflow.backend.configuration\u001b[0m:\u001b[36mload_backend_configuration\u001b[0m:\u001b[36m63\u001b[0m - \u001b[34m\u001b[1mLoaded raw configuration: {'input_format': 'numpy'}\u001b[0m\n",
                        "\u001b[32m2026-01-21 14:16:08.060\u001b[0m | \u001b[1mINFO    \u001b[0m | \u001b[36mpackflow.backend.configuration\u001b[0m:\u001b[36mload_backend_configuration\u001b[0m:\u001b[36m67\u001b[0m - \u001b[1mConfiguration: SklearnClassifierConfig(verbose=True, input_format=<InputFormats.NUMPY: 'numpy'>, rename_fields={}, feature_names=['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)'], flatten_nested_inputs=False, flatten_lists=False, nested_field_delimiter='.', serialized_model_path=PosixPath('/Users/cdao-user/.pyenv/versions/3.11.14/envs/packflow/lib/python3.11/site-packages/sklearn_classifier/model.joblib'))\u001b[0m\n",
                        "\u001b[32m2026-01-21 14:16:08.060\u001b[0m | \u001b[1mINFO    \u001b[0m | \u001b[36msklearn_classifier.inference\u001b[0m:\u001b[36minitialize\u001b[0m:\u001b[36m31\u001b[0m - \u001b[1mLoading model from: /Users/cdao-user/.pyenv/versions/3.11.14/envs/packflow/lib/python3.11/site-packages/sklearn_classifier/model.joblib\u001b[0m\n",
                        "\u001b[32m2026-01-21 14:16:08.061\u001b[0m | \u001b[1mINFO    \u001b[0m | \u001b[36mpackflow.backend.base\u001b[0m:\u001b[36m_initialize\u001b[0m:\u001b[36m103\u001b[0m - \u001b[1mInitialized Backend in 0.0009 ms\u001b[0m\n"
                    ]
                },
                {
                    "data": {
                        "text/plain": [
                            "Backend[\n",
                            "  SklearnClassifierConfig(verbose=True, input_format=<InputFormats.NUMPY: 'numpy'>, rename_fields={}, feature_names=['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)'], flatten_nested_inputs=False, flatten_lists=False, nested_field_delimiter='.', serialized_model_path=PosixPath('/Users/cdao-user/.pyenv/versions/3.11.14/envs/packflow/lib/python3.11/site-packages/sklearn_classifier/model.joblib'))\n",
                            "]"
                        ]
                    },
                    "execution_count": 9,
                    "metadata": {},
                    "output_type": "execute_result"
                }
            ],
            "source": [
                "from packflow.loaders import ModuleLoader\n",
                "\n",
                "# Import from the installed Poetry package\n",
                "# We want to import the `backend` object from the `inference` module\n",
                "# we will also pass a relative\n",
                "backend = ModuleLoader(\"sklearn_classifier.inference:backend\").load()\n",
                "\n",
                "backend"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 10,
            "id": "c42d3aa8-5dd0-430f-a276-334d4752fb47",
            "metadata": {},
            "outputs": [
                {
                    "name": "stderr",
                    "output_type": "stream",
                    "text": [
                        "\u001b[32m2026-01-21 14:16:08.071\u001b[0m | \u001b[1mINFO    \u001b[0m | \u001b[36mpackflow.backend.base\u001b[0m:\u001b[36m__call__\u001b[0m:\u001b[36m86\u001b[0m - \u001b[1mExecutionMetrics(batch_size=10, execution_times=ExecutionTimes(preprocess=0.01938, transform_inputs=0.09275, execute=0.60546, transform_outputs=0.02338), total_execution_time=0.74097)\u001b[0m\n"
                    ]
                },
                {
                    "data": {
                        "text/plain": [
                            "[{'class': 0}, {'class': 0}, {'class': 1}, {'class': 0}, {'class': 1}]"
                        ]
                    },
                    "execution_count": 10,
                    "metadata": {},
                    "output_type": "execute_result"
                }
            ],
            "source": [
                "from sklearn.datasets import load_iris\n",
                "\n",
                "X, _ = load_iris(return_X_y=True, as_frame=True)\n",
                "\n",
                "outputs = backend.validate(X.sample(10).to_dict(\"records\"))\n",
                "\n",
                "outputs[:5]"
            ]
        },
        {
            "cell_type": "markdown",
            "id": "96198c63-0013-4fdc-9b82-bd395a296e97",
            "metadata": {},
            "source": [
                "## Conclusion\n",
                "\n",
                "In this example notebook, we went through a simple example of how to create an inference backend for a scikit-learn classifier.\n",
                "\n",
                "Try extending this example further to support custom output logic or supporting different model types!"
            ]
        }
    ],
    "metadata": {
        "kernelspec": {
            "display_name": "packflow",
            "language": "python",
            "name": "python3"
        },
        "language_info": {
            "codemirror_mode": {
                "name": "ipython",
                "version": 3
            },
            "file_extension": ".py",
            "mimetype": "text/x-python",
            "name": "python",
            "nbconvert_exporter": "python",
            "pygments_lexer": "ipython3",
            "version": "3.11.14"
        }
    },
    "nbformat": 4,
    "nbformat_minor": 5
}