{ "cells": [ { "cell_type": "markdown", "id": "b784c0c7-7729-408a-a68f-a2b407ce03e1", "metadata": {}, "source": [ "# Scikit-Learn Classifier" ] }, { "cell_type": "markdown", "id": "b09ed980-8367-4f2e-a1ba-cc255a6797dd", "metadata": {}, "source": [ "This Notebook is designed to be an example for developing a modular, reusable Scikit-Learn classification backend. \n", "In this guide we will:\n", "\n", "1. Creating a project with the Poetry\n", "2. Train a classifier with Scikit-Learn\n", "3. Develop the Inference Backend for running the model with Packflow\n", "4. Load and validate the Backend from the installed package\n", "\n", "## Creating a Project\n", "\n", "First, We'll install poetry and create a new Project:" ] }, { "cell_type": "code", "execution_count": 1, "id": "1eb9b828-b480-4c50-86b6-6e434fd08cad", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Note: you may need to restart the kernel to use updated packages.\n" ] } ], "source": [ "%pip install poetry --quiet" ] }, { "cell_type": "code", "execution_count": 2, "id": "2d09201e-ced1-4eee-a680-39cf483990f7", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Created package sklearn_classifier in sklearn_classifier\n" ] } ], "source": [ "%%sh\n", "\n", "poetry new sklearn_classifier" ] }, { "cell_type": "markdown", "id": "d2cf8853-9d9b-4d4a-a2cc-669b8360e0dc", "metadata": {}, "source": [ "Next, we need to install a few dependencies to our poetry project:" ] }, { "cell_type": "code", "execution_count": 3, "id": "87ae0685-890b-4f05-8ae9-181d88e2c791", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Using version ^1.8.0 for scikit-learn\n", "Using version ^1.5.3 for joblib\n", "Using version ^3.0.0 for pandas\n", "\n", "Updating dependencies\n", "Resolving dependencies...\n", "\n", "No dependencies to install or update\n", "\n", "Writing lock file\n" ] } ], "source": [ "%%sh\n", "\n", "poetry --directory ./sklearn_classifier add scikit-learn joblib pandas" ] }, { "cell_type": "markdown", "id": "007e6aa3-0141-45b0-92f4-6689bc7d05ab", "metadata": {}, "source": [ "## Training a Iris Classifier\n", "\n", "For our sample use-case, we'll use the Scikit-Learn Iris dataset and train a simple Decision Tree Classifier:" ] }, { "cell_type": "code", "execution_count": 4, "id": "bc223ca1-2fc7-4d29-8b5f-1dcba964e9bc", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal length (cm)sepal width (cm)petal length (cm)petal width (cm)
215.13.71.50.4
294.73.21.60.2
1116.42.75.31.9
\n", "
" ], "text/plain": [ " sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)\n", "21 5.1 3.7 1.5 0.4\n", "29 4.7 3.2 1.6 0.2\n", "111 6.4 2.7 5.3 1.9" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.datasets import load_iris\n", "\n", "X, y = load_iris(return_X_y=True, as_frame=True)\n", "\n", "X.sample(3)" ] }, { "cell_type": "markdown", "id": "87289016-2f9f-439f-ab47-267aa9068c3e", "metadata": {}, "source": [ "For simplicity, we will ignore best practices and train our model on the entire dataset:" ] }, { "cell_type": "code", "execution_count": 5, "id": "e43682e3-a7f1-4f1c-9fab-7c3972b150e5", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['sklearn_classifier/src/sklearn_classifier/model.joblib']" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.tree import DecisionTreeClassifier\n", "import joblib\n", "\n", "model = DecisionTreeClassifier()\n", "\n", "model.fit(X, y)\n", "\n", "joblib.dump(model, \"sklearn_classifier/src/sklearn_classifier/model.joblib\")" ] }, { "cell_type": "markdown", "id": "ba2cc9e3-7a1b-45b6-b76f-36ea116afa35", "metadata": {}, "source": [ "The model has now been trained (fit) and serialized with joblib to the path output above.\n", "\n", "## Developing the Inference Backend\n", "\n", "Now we can develop the Inference Backend for running and sharing the model with Packflow:" ] }, { "cell_type": "code", "execution_count": 6, "id": "2238de9b-e710-488e-8a84-14e9a749cbc9", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Writing sklearn_classifier/src/sklearn_classifier/inference.py\n" ] } ], "source": [ "%%writefile sklearn_classifier/src/sklearn_classifier/inference.py\n", "\n", "# -- Packflow imports --\n", "from packflow import InferenceBackend, BackendConfig\n", "from packflow.utils.normalize import ensure_valid_output\n", "\n", "# -- Imports that are required to run the model --\n", "from pathlib import Path\n", "import pandas as pd\n", "import sklearn\n", "import joblib\n", "\n", "\n", "class SklearnClassifierConfig(BackendConfig):\n", " # Create a config field for where to load the model from\n", " serialized_model_path: str = Path(__file__).resolve().parent.joinpath('model.joblib')\n", " \n", " # Define the default input feature names\n", " feature_names: list[str] = [\n", " 'sepal length (cm)', \n", " 'sepal width (cm)', \n", " 'petal length (cm)', \n", " 'petal width (cm)'\n", " ]\n", "\n", "\n", "class Backend(InferenceBackend):\n", " # override the default model with the custom model defined above\n", " backend_config_model = SklearnClassifierConfig\n", "\n", " def initialize(self):\n", " self.logger.info(f'Loading model from: {self.config.serialized_model_path}')\n", " self.model = joblib.load(self.config.serialized_model_path)\n", "\n", " def transform_inputs(self, inputs):\n", " \"\"\"\n", " Convert input array (this backend uses the Numpy Preprocessor) to a Pandas DataFrame\n", " \"\"\"\n", " return pd.DataFrame(columns=self.config.feature_names, data=inputs)\n", " \n", " \n", " def execute(self, inputs):\n", " \"\"\"\n", " Run the Pandas DataFrame through the loaded model \n", " and return the predicted class.\n", " \"\"\"\n", " return self.model.predict(inputs)\n", "\n", " def transform_outputs(self, outputs):\n", " \"\"\"\n", " Use Packflow.dev to convert outputs to safe return types.\n", "\n", " Note: \n", " This method is less flexible and does not apply business-logic.\n", " However, for this demo we will assume the output does not need\n", " any special postprocessing.\n", " \"\"\"\n", " return ensure_valid_output(outputs, parent_key='class')\n", "\n", "\n", "# Set defaults for base fields\n", "backend = Backend(\n", " input_format='numpy'\n", ")" ] }, { "cell_type": "markdown", "id": "bac49ef2-0920-471e-bb56-cd8b12cfac1a", "metadata": {}, "source": [ "We will also need to add the `inference` module to the package by adding it to the `__init__.py` file so it can be imported:" ] }, { "cell_type": "code", "execution_count": 7, "id": "79eca476-4c84-4929-969a-b5e673ff901f", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Overwriting sklearn_classifier/src/sklearn_classifier/__init__.py\n" ] } ], "source": [ "%%writefile sklearn_classifier/src/sklearn_classifier/__init__.py\n", "\n", "from . import inference" ] }, { "cell_type": "markdown", "id": "33eab80d-d9cc-474a-94f1-3bf9e436ffda", "metadata": {}, "source": [ "Now that we've created an `inference.py` file to our poetry package, we can use Packflow's ModuleLoader to import the backend and run it wherever needed." ] }, { "cell_type": "markdown", "id": "c3b02ecd-97e6-4931-a45a-d3b21c8a892b", "metadata": {}, "source": [ "## Validating the Inference Backend\n", "\n", "Now that we've create a snapshot, let's load and validate the backend is running as expected:" ] }, { "cell_type": "code", "execution_count": 8, "id": "6cc879b6-6611-4e70-84d6-c6758b121502", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Note: you may need to restart the kernel to use updated packages.\n" ] } ], "source": [ "%pip install ./sklearn_classifier --quiet" ] }, { "cell_type": "markdown", "id": "176d960e-17a4-465f-a391-7a4c0984680b", "metadata": {}, "source": [ "## IMPORTANT\n", "You will likely need to restart the kernel in this notebook to proceed with loading and running the inference backend!" ] }, { "cell_type": "code", "execution_count": 9, "id": "f17b2c78-d096-405b-ab8b-4f70879c0051", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32m2026-01-21 14:16:08.037\u001b[0m | \u001b[34m\u001b[1mDEBUG \u001b[0m | \u001b[36mpackflow.utils.normalize.base\u001b[0m:\u001b[36m_import_module\u001b[0m:\u001b[36m30\u001b[0m - \u001b[34m\u001b[1mTorchScalarHandler Type Converter is not available. Reason: No module named 'torch'\u001b[0m\n", "\u001b[32m2026-01-21 14:16:08.038\u001b[0m | \u001b[34m\u001b[1mDEBUG \u001b[0m | \u001b[36mpackflow.utils.normalize.base\u001b[0m:\u001b[36m_import_module\u001b[0m:\u001b[36m30\u001b[0m - \u001b[34m\u001b[1mTorchTensorHandler Type Converter is not available. Reason: No module named 'torch'\u001b[0m\n", "\u001b[32m2026-01-21 14:16:08.039\u001b[0m | \u001b[34m\u001b[1mDEBUG \u001b[0m | \u001b[36mpackflow.utils.normalize.base\u001b[0m:\u001b[36m_import_module\u001b[0m:\u001b[36m30\u001b[0m - \u001b[34m\u001b[1mPillowImageHandler Type Converter is not available. Reason: No module named 'PIL'\u001b[0m\n", "\u001b[32m2026-01-21 14:16:08.059\u001b[0m | \u001b[34m\u001b[1mDEBUG \u001b[0m | \u001b[36mpackflow.backend.configuration\u001b[0m:\u001b[36mload_backend_configuration\u001b[0m:\u001b[36m63\u001b[0m - \u001b[34m\u001b[1mLoaded raw configuration: {'input_format': 'numpy'}\u001b[0m\n", "\u001b[32m2026-01-21 14:16:08.060\u001b[0m | \u001b[1mINFO \u001b[0m | \u001b[36mpackflow.backend.configuration\u001b[0m:\u001b[36mload_backend_configuration\u001b[0m:\u001b[36m67\u001b[0m - \u001b[1mConfiguration: SklearnClassifierConfig(verbose=True, input_format=, rename_fields={}, feature_names=['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)'], flatten_nested_inputs=False, flatten_lists=False, nested_field_delimiter='.', serialized_model_path=PosixPath('/Users/cdao-user/.pyenv/versions/3.11.14/envs/packflow/lib/python3.11/site-packages/sklearn_classifier/model.joblib'))\u001b[0m\n", "\u001b[32m2026-01-21 14:16:08.060\u001b[0m | \u001b[1mINFO \u001b[0m | \u001b[36msklearn_classifier.inference\u001b[0m:\u001b[36minitialize\u001b[0m:\u001b[36m31\u001b[0m - \u001b[1mLoading model from: /Users/cdao-user/.pyenv/versions/3.11.14/envs/packflow/lib/python3.11/site-packages/sklearn_classifier/model.joblib\u001b[0m\n", "\u001b[32m2026-01-21 14:16:08.061\u001b[0m | \u001b[1mINFO \u001b[0m | \u001b[36mpackflow.backend.base\u001b[0m:\u001b[36m_initialize\u001b[0m:\u001b[36m103\u001b[0m - \u001b[1mInitialized Backend in 0.0009 ms\u001b[0m\n" ] }, { "data": { "text/plain": [ "Backend[\n", " SklearnClassifierConfig(verbose=True, input_format=, rename_fields={}, feature_names=['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)'], flatten_nested_inputs=False, flatten_lists=False, nested_field_delimiter='.', serialized_model_path=PosixPath('/Users/cdao-user/.pyenv/versions/3.11.14/envs/packflow/lib/python3.11/site-packages/sklearn_classifier/model.joblib'))\n", "]" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from packflow.loaders import ModuleLoader\n", "\n", "# Import from the installed Poetry package\n", "# We want to import the `backend` object from the `inference` module\n", "# we will also pass a relative\n", "backend = ModuleLoader(\"sklearn_classifier.inference:backend\").load()\n", "\n", "backend" ] }, { "cell_type": "code", "execution_count": 10, "id": "c42d3aa8-5dd0-430f-a276-334d4752fb47", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32m2026-01-21 14:16:08.071\u001b[0m | \u001b[1mINFO \u001b[0m | \u001b[36mpackflow.backend.base\u001b[0m:\u001b[36m__call__\u001b[0m:\u001b[36m86\u001b[0m - \u001b[1mExecutionMetrics(batch_size=10, execution_times=ExecutionTimes(preprocess=0.01938, transform_inputs=0.09275, execute=0.60546, transform_outputs=0.02338), total_execution_time=0.74097)\u001b[0m\n" ] }, { "data": { "text/plain": [ "[{'class': 0}, {'class': 0}, {'class': 1}, {'class': 0}, {'class': 1}]" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.datasets import load_iris\n", "\n", "X, _ = load_iris(return_X_y=True, as_frame=True)\n", "\n", "outputs = backend.validate(X.sample(10).to_dict(\"records\"))\n", "\n", "outputs[:5]" ] }, { "cell_type": "markdown", "id": "96198c63-0013-4fdc-9b82-bd395a296e97", "metadata": {}, "source": [ "## Conclusion\n", "\n", "In this example notebook, we went through a simple example of how to create an inference backend for a scikit-learn classifier.\n", "\n", "Try extending this example further to support custom output logic or supporting different model types!" ] } ], "metadata": { "kernelspec": { "display_name": "packflow", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.14" } }, "nbformat": 4, "nbformat_minor": 5 }