Introduction
My UFC prediction model has maintained profitable performance for years despite having poorly calibrated probability estimates. While the model excels at binary classification (picking winners), its confidence scores don't align well with real-world win frequencies—a classic case of good accuracy, bad calibration.
This never bothered me much since the model was profitable, but I decided to experiment with post-hoc calibration methods to see if I could improve probability estimates without hurting classification performance. This post documents those experiments: the methods I tested, why most failed with limited data, and the modest improvements I eventually achieved with Platt scaling.
What is Model Calibration?
Model calibration refers to how well a model's predicted probabilities align with actual observed frequencies. A perfectly calibrated model should be correct 70% of the time when it predicts a 70% probability, 80% of the time when it predicts 80%, and so on.
Consider this example: if your model predicts 100 fights at 60% confidence, and the favored fighter wins 60 of those fights, your model is well-calibrated at that confidence level. However, if the favored fighter wins 75 times, your model is under-confident and poorly calibrated.
from sklearn.calibration import calibration_curve
import matplotlib.pyplot as plt
# Example: plotting calibration curve
fraction_pos, mean_pred = calibration_curve(y_true, y_prob, n_bins=10)
plt.plot([0, 1], [0, 1], 'k--', label='Perfect calibration')
plt.plot(mean_pred, fraction_pos, 's-', label='Model')
plt.xlabel('Mean Predicted Probability')
plt.ylabel('Fraction of Positives')
The Calibration vs. Accuracy Paradox
One of the most counterintuitive aspects of calibration is that improving probability estimates can sometimes hurt classification accuracy. This happens because calibration optimizes for probability-based metrics like log-loss and Brier score, while accuracy only cares about the binary decision boundary at 50%.
Why This Happens
Consider a model that consistently predicts 65% when the true probability is 60%. This model might achieve higher accuracy by being overconfident (correctly picking the winner more often due to increased confidence), but it's poorly calibrated. When calibrated, the model's predictions become more conservative, potentially crossing the 50% decision boundary in some cases and reducing accuracy.
In my UFC model, this manifested as:
Original Model Performance:
Accuracy: 0.7098
Log Loss: 0.6032
Brier Score: 0.2075
Calibrated Model Performance:
Accuracy: 0.7098 (unchanged)
Log Loss: 0.5979 (improved by 0.0053)
Brier Score: 0.2056 (improved by 0.0019)
The accuracy remained stable while probability-based metrics improved significantly but that isn't true for all calibration methods and isn't even necessarily true 100% of the time for the method that actually worked.
The Small Sample Size Challenge
My dataset contains approximately 2,400 UFC fights over 10 years after extensive filtering:
def filter_fights(df, threshold, date='2015-01-01', include_split_dec=False):
"""
Filter fights based on:
- Binary results (y_true in [0, 1])
- Both fighters must have had at least num_fights previous fights
- Removing unwanted fight methods (DQ, split decisions, etc.)
- Fights from 2015 onward
"""
# Remove unwanted methods
if include_split_dec:
unwanted_methods = ['dq', 'other', 'overturned']
else:
unwanted_methods = ['dq', 'other', 'decision - split', 'decision - majority', 'overturned']
# Filter to only binary results and recent fights
df = df[df['y_true'].isin([0, 1])].copy()
df = df[df['event_date'] >= pd.Timestamp(date)]
return df
This small sample size created significant challenges for calibration, particularly with isotonic regression which requires sufficient data points across the probability spectrum.
Isotonic Regression: The First Attempt
Isotonic regression is a non-parametric calibration method that learns a monotonic mapping from predicted probabilities to calibrated probabilities. It's theoretically superior to Platt scaling as it can capture non-linear calibration relationships.
class SimpleIsotonicCalibration:
def __init__(self, y_min=0.01, y_max=0.99):
self.y_min = y_min
self.y_max = y_max
self.calibrator = None
def fit(self, y_prob, y_true):
from sklearn.isotonic import IsotonicRegression
self.calibrator = IsotonicRegression(
y_min=self.y_min,
y_max=self.y_max,
out_of_bounds='clip'
)
self.calibrator.fit(y_prob, y_true)
Isotonic Results (Disappointing)
Isotonic Calibration Results:
Should Use Calibration: False
Calibration Set Log Loss Improvement: 0.0553 # Improvement on the calibration set of data
Test Set Log Loss Improvement: 0.0125 # Worse on the unseen test set!
Test Set Brier Score Improvement: 0.0023 # Worse on the unseen test set!
Final Test Log Loss (Original): 0.5937
Final Test Log Loss (Calibrated): 0.6063
The isotonic calibration actually hurt performance on the test set despite improving calibration set metrics. This is a classic sign of overfitting due to insufficient data.

Why Isotonic Failed
- Small sample size: With only ~240 test samples (10% of 2,400), isotonic regression had insufficient data to learn a robust monotonic mapping
- Sparse probability regions: Some probability ranges had very few examples, leading to unreliable calibration
- Overfitting: The flexibility of isotonic regression became a liability with limited data
I attempted several improvements:
- Ensemble approach: Multiple isotonic regressors trained on different CV folds using all the training data. This was dumb because I was just training the calibration model on fights the main model had already been trained on leading to overfitting on train, and poor results on the holdout test dataset.
- Expanded calibration set: Increased from 10% to 20% of data
- Parameter tuning: Adjusted
y_min
,y_max
, andout_of_bounds
settings
None of these approaches yielded meaningful improvements.
Platt Scaling: A Little Better
Platt scaling uses logistic regression to map uncalibrated probabilities to calibrated ones. While less flexible than isotonic regression, it's much more suitable for small datasets.
class SimplePlattCalibration:
def __init__(self, max_iter=100, random_state=42):
self.max_iter = max_iter
self.random_state = random_state
self.calibrator = None
def fit(self, y_prob, y_true):
from sklearn.linear_model import LogisticRegression
# Reshape probabilities for sklearn (needs 2D input)
y_prob_reshaped = y_prob.reshape(-1, 1)
self.calibrator = LogisticRegression(
max_iter=self.max_iter,
random_state=self.random_state,
solver='lbfgs'
)
self.calibrator.fit(y_prob_reshaped, y_true)
Implementation in Training Pipeline
The calibration was integrated into the training pipeline using scikit-learn's CalibratedClassifierCV
:
# Three-way split for proper calibration validation
(X_train, y_train), (X_cal, y_cal), (X_test, y_test) = split_data_three_way(
X, y, train_size=0.775, val_size=0.125
)
# Wrap AutoGluon predictor for sklearn compatibility
autogluon_wrapper = AutoGluonWrapper(predictor, feature_columns=X_train.columns.tolist())
# Create calibrated classifier using holdout method
calibrated_clf = CalibratedClassifierCV(
estimator=autogluon_wrapper,
method='sigmoid', # Platt scaling
cv="prefit", # Use prefit since AutoGluon is already trained
ensemble=False # Use single calibrator since we have proper split
)
# Fit calibrator on holdout calibration set
calibrated_clf.fit(X_cal_clean, y_cal)
Platt Scaling Results (Success!)
Calibration Results (sigmoid):
Should Use Calibration: True
Calibration Set Log Loss Improvement: 0.0098
Test Set Log Loss Improvement: 0.0107
Test Set Brier Score Improvement: 0.0043
Test Set ECE Improvement: 0.0174
Final Test Log Loss (Original): 0.5948
Final Test Log Loss (Calibrated): 0.5841
The improvement of 0.0107 in log-loss on the holdout test set of data represents a meaningful, if small, gain in probability accuracy.
Profitability Impact Analysis
While the statistical improvements were modest, I wanted to examine whether calibration affected betting profitability. Here's the cumulative profit comparison between the original and calibrated models on the test set:


The results show modest but consistent improvements across most betting strategies. The core strategy (ai_all_picks_closing) improved from 13.26% to 13.60% ROI, while the seven-day advance strategy increased from 14.98% to 15.30% ROI. Notably, win percentages remained identical at 70.70% which is interesting because some picks do change based on the calibration process.
Interestingly, underdog-focused strategies saw slight decreases in ROI (35.31% to 34.97% for closing odds) and slightly improved favorite win percentages (70.70% to 70.80%), suggesting the calibration process may have made the model slightly more conservative on high-value underdog picks. Meanwhile, the poorly performing +EV on ANY fighter (regardless of AI pick) strategies remained unprofitable in both versions but dramatically improved in the calibrated version showing the calibration process working in action. The AI-picked +EV strategy was basically the same though which tells me I need more profit testing on where the +EV threshold is, like +5% win chance? +10% win chance? When do we bet against the AI pick? Like last event Kevin Holland was picked by AI at -180 or something yet Vegas had him at -500. What's the threshold to pick against the AI? IDK yet but I'll go test it out. This is the most pressing question when we're talking about betting strategy.
However, it's crucial to note that ROI is a moving target—betting markets evolve, line movement varies, and small sample sizes can significantly impact results. These profit tests represent performance on a specific test set and shouldn't be viewed as guaranteed future returns. The real value of calibration lies in more reliable probability estimates for bet sizing and strategy decisions rather than raw profit maximization.
Model Performance Analysis
Here's how the calibrated model performed against Vegas odds across all unfiltered fights in the past 1.5 years (meaning we include split decisions, and DQ's, and whatnot which aren't included in the model training):
{
"vegas_odds_performance": {
"accuracy": 0.700,
"log_loss": 0.563,
"brier_score": 0.194
},
"mma_ai_performance": {
"accuracy": 0.710,
"log_loss": 0.603,
"brier_score": 0.208
},
"mma_ai_performance_calibrated": {
"accuracy": 0.710,
"log_loss": 0.598, # Improved
"brier_score": 0.206 # Improved
}
}
The calibrated model maintains identical accuracy while providing more reliable probability estimates.
The Calibration Curve Analysis
The most telling evidence comes from the calibration curve itself. My uncalibrated model exhibited classic miscalibration patterns:
- Underconfident in 50-60% range: When the model predicted 55%, fighters actually won ~60% of the time
- Overconfident in 40-50% range: When the model predicted 45%, fighters won closer to 40% of the time
def plot_calibration_curve(self, n_bins=10, include_all_fights: bool = False):
# Get calibrated predictions if calibrator is available
y_prob_calibrated = None
if self.calibrator is not None:
y_prob_calibrated = self.calibrator.predict_proba(test_data_clean)[:, 1]
# Calculate calibration curves
prob_true_model, prob_pred_model = calibration_curve(self.y_test, y_prob_model, n_bins=n_bins)
if y_prob_calibrated is not None:
prob_true_calibrated, prob_pred_calibrated = calibration_curve(
self.y_test, y_prob_calibrated, n_bins=n_bins
)
# Plot results
plt.plot([0, 1], [0, 1], 'k--', label='Perfect Calibration')
plt.plot(prob_pred_model, prob_true_model, 's-', label='Model (Original)')
if y_prob_calibrated is not None:
plt.plot(prob_pred_calibrated, prob_true_calibrated, '^-',
label='Model Calibrated', alpha=0.8)
The Odds Inclusion Paradox
An interesting discovery was the relationship between including betting odds and calibration. In previous experiments 1-2 years ago, calibration consistently hurt performance. The key difference was I was including Vegas odds as features.
With Odds (Historical)
- Pros: Exceptional calibration, higher accuracy (~73-74%)
- Cons: Model HEAVILY favored odds over other features, struggled with underdog picks, -6% lower ROI against all model picks
Without Odds (Current)
- Pros: Better underdog detection, higher betting ROI, room for calibration improvement
- Cons: Lower accuracy (~71%), requires manual calibration
This represents a fascinating tradeoff: accuracy vs. profitability. The model without odds hits significantly more profitable underdog picks, even though its raw accuracy is lower.
Why Calibration Matters for Bettors
While calibration may not be necessary for a profitable model (mine was already profitable before calibration), it provides crucial benefits for bet sizing and strategy:
Kelly Criterion Application
With well-calibrated probabilities, bettors can use the Kelly Criterion more effectively:
def kelly_bet_size(prob, odds, bankroll):
"""Calculate optimal bet size using Kelly Criterion"""
decimal_odds = american_to_decimal(odds)
edge = prob * decimal_odds - 1
if edge <= 0:
return 0
kelly_fraction = edge / (decimal_odds - 1)
return min(kelly_fraction * bankroll, bankroll * 0.1) # Cap at 10%
I experimented with fractional kelly in the past. It's promising long term, especially since the model's logloss is consistently less than .059, but the wild swings are too painful. If we got the model to be lower logloss than Vegas by including the odds in the feature set I think this is promising but will experiment more later with that.
Confidence-Based Strategies
Calibrated probabilities enable more sophisticated betting strategies:
def edge_based_betting_strategy(predictions, min_edge=0.05):
"""Select bets based on edge over Vegas odds rather than absolute confidence."""
betting_opportunities = []
for fight in predictions:
# Get model probability and Vegas implied probability
model_prob = fight['model_confidence']
vegas_decimal_odds = american_to_decimal(fight['vegas_odds'])
vegas_implied_prob = 1 / vegas_decimal_odds
# Calculate edge (model probability - market probability)
edge = model_prob - vegas_implied_prob
# Only bet if we have a significant edge
if edge >= min_edge:
kelly_fraction = edge / (vegas_decimal_odds - 1) # Kelly criterion
betting_opportunities.append({
'pick': fight['fighter_name'],
'edge': edge,
})
return betting_opportunities
I'll probably implement something like this later on after I do more profitability backtesting with flat units.
Technical Implementation Details
The final calibration system integrates seamlessly with the existing prediction pipeline:
def _get_model_predictions(self, test_data, use_calibrated=None):
"""Get model predictions, optionally using calibrator."""
if use_calibrated is None:
use_calibrated = self.calibrator is not None
# Get original predictions
y_pred = self.predictor.predict(test_data)
y_prob = self.predictor.predict_proba(test_data)
# Apply calibration if available
if use_calibrated and self.calibrator is not None:
test_data_clean = test_data.drop(columns=['sample_weight'], errors='ignore')
y_prob = self.calibrator.predict_proba(test_data_clean)[:, 1]
y_pred = (y_prob > 0.5).astype(int)
return y_pred, y_prob
Lessons Learned
- Dataset size matters: Isotonic regression requires substantial data; Platt scaling works better with limited samples
- Validation strategy is crucial: Proper train/calibration/test splits prevent overfitting
- Calibration ≠ Accuracy: Better probabilities don't always mean better classifications
- Feature engineering impacts calibration: Including odds improves calibration but hurts profitability
- Domain expertise guides tradeoffs: Understanding the betting market informed the decision to exclude odds
Conclusion
While the final improvement in logloss was modest (0.0107), it represents a meaningful step toward more reliable probability estimates.
The key insight is that calibration serves different purposes depending on your goals. For pure classification accuracy, it is unnecessary. For models that are already well calibrated (like when you include the odds, or use algos that are better for calibration like Random Forest) calibrating the predictions using Platt or Isotonic harms the output.
The surprising relationship between feature inclusion (odds) and calibration highlights the complex tradeoffs in machine learning systems. Sometimes the most accurate model isn't the most profitable one, and the most calibrated model isn't the most accurate one.
For practitioners working with limited datasets, Platt scaling offers a robust path to improved calibration. The simplicity of logistic regression makes it both interpretable and reliable, even when more sophisticated methods fail.