Machine Learning Pipelines with Python & scikit-learn
Building a model in a notebook is one thing. Shipping it reliably to production is another. In this post I'll walk through building a production-ready ML pipeline using scikit-learn: from raw data to a serialized model ready to serve predictions.
Why Pipelines?
Without a pipeline, your preprocessing steps live in scattered cells or helper functions. When you retrain or deploy, it's easy to:
- Forget to apply the same scaler to test data
- Leak validation data into your training transforms
- Lose track of which feature encoding was used
A sklearn.pipeline.Pipeline chains transformers and an estimator into a single object. One .fit(), one .predict(), one .pkl file.
The Dataset
We'll use the classic Titanic survival dataset as a running example:
import pandas as pd
df = pd.read_csv("titanic.csv")
print(df.head())
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Fare | Embarked |
|---|---|---|---|---|---|---|---|---|---|
| 1 | 0 | 3 | Braund... | male | 22.0 | 1 | 0 | 7.25 | S |
| 2 | 1 | 1 | Cumings... | female | 38.0 | 1 | 0 | 71.28 | C |
Feature Engineering
Split features by type so we can apply different transformations:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
numeric_features = ['Age', 'Fare', 'SibSp', 'Parch']
categorical_features = ['Sex', 'Embarked', 'Pclass']
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler()),
])
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='most_frequent')),
('encoder', OneHotEncoder(handle_unknown='ignore', sparse_output=False)),
])
preprocessor = ColumnTransformer(transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features),
])
Building the Full Pipeline
from sklearn.ensemble import GradientBoostingClassifier
pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', GradientBoostingClassifier(
n_estimators=200,
learning_rate=0.05,
max_depth=4,
random_state=42,
)),
])
Train / Validation Split
from sklearn.model_selection import train_test_split
X = df.drop(columns=['Survived', 'PassengerId', 'Name', 'Ticket', 'Cabin'])
y = df['Survived']
X_train, X_val, y_train, y_val = train_test_split(
X, y, test_size=0.2, stratify=y, random_state=42
)
pipeline.fit(X_train, y_train)
Evaluation
from sklearn.metrics import classification_report, roc_auc_score
y_pred = pipeline.predict(X_val)
y_proba = pipeline.predict_proba(X_val)[:, 1]
print(classification_report(y_val, y_pred))
print(f"ROC-AUC: {roc_auc_score(y_val, y_proba):.4f}")
Example output:
precision recall f1-score support
0 0.84 0.89 0.87 110
1 0.81 0.74 0.77 69
accuracy 0.83 179
ROC-AUC: 0.8901
Hyperparameter Tuning with GridSearchCV
Pass parameter names with the step prefix:
from sklearn.model_selection import GridSearchCV
param_grid = {
'classifier__n_estimators': [100, 200, 300],
'classifier__learning_rate': [0.01, 0.05, 0.1],
'classifier__max_depth': [3, 4, 5],
}
search = GridSearchCV(
pipeline,
param_grid,
cv=5,
scoring='roc_auc',
n_jobs=-1,
verbose=1,
)
search.fit(X_train, y_train)
print("Best params:", search.best_params_)
print("Best CV AUC:", search.best_score_)
đĄ Tip: For large grids, use RandomizedSearchCV with n_iter=50 â it samples randomly and often finds good solutions much faster.
Serialising the Model
import joblib
joblib.dump(search.best_estimator_, "model.pkl")
# Later, in your API:
model = joblib.load("model.pkl")
prediction = model.predict(new_data_df)
â ī¸ Warning: joblib pickles are version-sensitive. Pin your scikit-learn version in requirements.txt and document the Python version used to serialize.
Serving with FastAPI
from fastapi import FastAPI
from pydantic import BaseModel
import joblib, pandas as pd
app = FastAPI()
model = joblib.load("model.pkl")
class Passenger(BaseModel):
Age: float
Fare: float
SibSp: int
Parch: int
Sex: str
Embarked: str
Pclass: int
@app.post("/predict")
def predict(passenger: Passenger):
df = pd.DataFrame([passenger.dict()])
prob = model.predict_proba(df)[0][1]
return {"survived_probability": round(float(prob), 4)}
Key Takeaways
- Always use a Pipeline â it prevents data leakage and simplifies deployment
ColumnTransformerlets you apply different transforms to different columns cleanly- Serialize with
jobliband pin your dependency versions GridSearchCVwithn_jobs=-1parallelises tuning across all CPU cores- FastAPI is the fastest path from model to REST endpoint
âšī¸ Note: For very large datasets, consider sklearn's partial_fit for incremental learning, or graduate to frameworks like XGBoost, LightGBM, or CatBoost for better performance on tabular data.