Machine Learning Pipelines with Python & scikit-learn

Building a model in a notebook is one thing. Shipping it reliably to production is another. In this post I'll walk through building a production-ready ML pipeline using scikit-learn: from raw data to a serialized model ready to serve predictions.

Why Pipelines?

Without a pipeline, your preprocessing steps live in scattered cells or helper functions. When you retrain or deploy, it's easy to:

Forget to apply the same scaler to test data
Leak validation data into your training transforms
Lose track of which feature encoding was used

A sklearn.pipeline.Pipeline chains transformers and an estimator into a single object. One .fit(), one .predict(), one .pkl file.

The Dataset

We'll use the classic Titanic survival dataset as a running example:

import pandas as pd

df = pd.read_csv("titanic.csv")
print(df.head())

PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Fare	Embarked
1	0	3	Braund...	male	22.0	1	0	7.25	S
2	1	1	Cumings...	female	38.0	1	0	71.28	C

Feature Engineering

Split features by type so we can apply different transformations:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer

numeric_features = ['Age', 'Fare', 'SibSp', 'Parch']
categorical_features = ['Sex', 'Embarked', 'Pclass']

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler()),
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore', sparse_output=False)),
])

preprocessor = ColumnTransformer(transformers=[
    ('num', numeric_transformer, numeric_features),
    ('cat', categorical_transformer, categorical_features),
])

Building the Full Pipeline

from sklearn.ensemble import GradientBoostingClassifier

pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', GradientBoostingClassifier(
        n_estimators=200,
        learning_rate=0.05,
        max_depth=4,
        random_state=42,
    )),
])

Train / Validation Split

from sklearn.model_selection import train_test_split

X = df.drop(columns=['Survived', 'PassengerId', 'Name', 'Ticket', 'Cabin'])
y = df['Survived']

X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

pipeline.fit(X_train, y_train)

Evaluation

from sklearn.metrics import classification_report, roc_auc_score

y_pred = pipeline.predict(X_val)
y_proba = pipeline.predict_proba(X_val)[:, 1]

print(classification_report(y_val, y_pred))
print(f"ROC-AUC: {roc_auc_score(y_val, y_proba):.4f}")

Example output:

              precision    recall  f1-score   support
           0       0.84      0.89      0.87       110
           1       0.81      0.74      0.77        69

    accuracy                           0.83       179
ROC-AUC: 0.8901

Hyperparameter Tuning with GridSearchCV

Pass parameter names with the step prefix:

from sklearn.model_selection import GridSearchCV

param_grid = {
    'classifier__n_estimators': [100, 200, 300],
    'classifier__learning_rate': [0.01, 0.05, 0.1],
    'classifier__max_depth': [3, 4, 5],
}

search = GridSearchCV(
    pipeline,
    param_grid,
    cv=5,
    scoring='roc_auc',
    n_jobs=-1,
    verbose=1,
)
search.fit(X_train, y_train)

print("Best params:", search.best_params_)
print("Best CV AUC:", search.best_score_)

💡

💡 Tip: For large grids, use RandomizedSearchCV with n_iter=50 — it samples randomly and often finds good solutions much faster.

Serialising the Model

import joblib

joblib.dump(search.best_estimator_, "model.pkl")

# Later, in your API:
model = joblib.load("model.pkl")
prediction = model.predict(new_data_df)

💡

⚠️ Warning: joblib pickles are version-sensitive. Pin your scikit-learn version in requirements.txt and document the Python version used to serialize.

Serving with FastAPI

from fastapi import FastAPI
from pydantic import BaseModel
import joblib, pandas as pd

app = FastAPI()
model = joblib.load("model.pkl")

class Passenger(BaseModel):
    Age: float
    Fare: float
    SibSp: int
    Parch: int
    Sex: str
    Embarked: str
    Pclass: int

@app.post("/predict")
def predict(passenger: Passenger):
    df = pd.DataFrame([passenger.dict()])
    prob = model.predict_proba(df)[0][1]
    return {"survived_probability": round(float(prob), 4)}

Key Takeaways

Always use a Pipeline — it prevents data leakage and simplifies deployment
ColumnTransformer lets you apply different transforms to different columns cleanly
Serialize with joblib and pin your dependency versions
GridSearchCV with n_jobs=-1 parallelises tuning across all CPU cores
FastAPI is the fastest path from model to REST endpoint

💡

ℹ️ Note: For very large datasets, consider sklearn's partial_fit for incremental learning, or graduate to frameworks like XGBoost, LightGBM, or CatBoost for better performance on tabular data.

Machine Learning Pipelines with Python & scikit-learn

Why Pipelines?

Without a pipeline, your preprocessing steps live in scattered cells or helper functions. When you retrain or deploy, it's easy to:

Forget to apply the same scaler to test data
Leak validation data into your training transforms
Lose track of which feature encoding was used

A sklearn.pipeline.Pipeline chains transformers and an estimator into a single object. One .fit(), one .predict(), one .pkl file.

The Dataset

We'll use the classic Titanic survival dataset as a running example:

import pandas as pd

df = pd.read_csv("titanic.csv")
print(df.head())

PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Fare	Embarked
1	0	3	Braund...	male	22.0	1	0	7.25	S
2	1	1	Cumings...	female	38.0	1	0	71.28	C

Feature Engineering

Split features by type so we can apply different transformations:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer

numeric_features = ['Age', 'Fare', 'SibSp', 'Parch']
categorical_features = ['Sex', 'Embarked', 'Pclass']

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler()),
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore', sparse_output=False)),
])

preprocessor = ColumnTransformer(transformers=[
    ('num', numeric_transformer, numeric_features),
    ('cat', categorical_transformer, categorical_features),
])

Building the Full Pipeline

from sklearn.ensemble import GradientBoostingClassifier

pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', GradientBoostingClassifier(
        n_estimators=200,
        learning_rate=0.05,
        max_depth=4,
        random_state=42,
    )),
])

Train / Validation Split

from sklearn.model_selection import train_test_split

X = df.drop(columns=['Survived', 'PassengerId', 'Name', 'Ticket', 'Cabin'])
y = df['Survived']

X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

pipeline.fit(X_train, y_train)

Evaluation

from sklearn.metrics import classification_report, roc_auc_score

y_pred = pipeline.predict(X_val)
y_proba = pipeline.predict_proba(X_val)[:, 1]

print(classification_report(y_val, y_pred))
print(f"ROC-AUC: {roc_auc_score(y_val, y_proba):.4f}")

Example output:

              precision    recall  f1-score   support
           0       0.84      0.89      0.87       110
           1       0.81      0.74      0.77        69

    accuracy                           0.83       179
ROC-AUC: 0.8901

Hyperparameter Tuning with GridSearchCV

Pass parameter names with the step prefix:

from sklearn.model_selection import GridSearchCV

param_grid = {
    'classifier__n_estimators': [100, 200, 300],
    'classifier__learning_rate': [0.01, 0.05, 0.1],
    'classifier__max_depth': [3, 4, 5],
}

search = GridSearchCV(
    pipeline,
    param_grid,
    cv=5,
    scoring='roc_auc',
    n_jobs=-1,
    verbose=1,
)
search.fit(X_train, y_train)

print("Best params:", search.best_params_)
print("Best CV AUC:", search.best_score_)

💡

💡 Tip: For large grids, use RandomizedSearchCV with n_iter=50 — it samples randomly and often finds good solutions much faster.

Serialising the Model

import joblib

joblib.dump(search.best_estimator_, "model.pkl")

# Later, in your API:
model = joblib.load("model.pkl")
prediction = model.predict(new_data_df)

💡

⚠️ Warning: joblib pickles are version-sensitive. Pin your scikit-learn version in requirements.txt and document the Python version used to serialize.

Serving with FastAPI

from fastapi import FastAPI
from pydantic import BaseModel
import joblib, pandas as pd

app = FastAPI()
model = joblib.load("model.pkl")

class Passenger(BaseModel):
    Age: float
    Fare: float
    SibSp: int
    Parch: int
    Sex: str
    Embarked: str
    Pclass: int

@app.post("/predict")
def predict(passenger: Passenger):
    df = pd.DataFrame([passenger.dict()])
    prob = model.predict_proba(df)[0][1]
    return {"survived_probability": round(float(prob), 4)}

Key Takeaways

Always use a Pipeline — it prevents data leakage and simplifies deployment
ColumnTransformer lets you apply different transforms to different columns cleanly
Serialize with joblib and pin your dependency versions
GridSearchCV with n_jobs=-1 parallelises tuning across all CPU cores
FastAPI is the fastest path from model to REST endpoint

💡

Kavidu Hasaranga

Machine Learning Pipelines with Python & scikit-learn

Machine Learning Pipelines with Python & scikit-learn

Why Pipelines?

The Dataset

Feature Engineering

Building the Full Pipeline

Train / Validation Split

Evaluation

Hyperparameter Tuning with GridSearchCV

Serialising the Model

Serving with FastAPI

Key Takeaways

Kavidu Hasaranga

Machine Learning Pipelines with Python & scikit-learn

Machine Learning Pipelines with Python & scikit-learn

Why Pipelines?

The Dataset

Feature Engineering

Building the Full Pipeline

Train / Validation Split

Evaluation

Hyperparameter Tuning with GridSearchCV

Serialising the Model

Serving with FastAPI

Key Takeaways