Predict Sports With AI

Why?

You can find scattered articles around Medium and machine learning sites written by other people attempting to predict sports with machine learning algorithms. The majority of these articles will say, “In conclusion, there’s some promise but it doesn’t look like it works that well”. This is patently false. Sports are excellent applications of machine learning algorithms, much better than say, the stock market or housing prices because the variables involved in calculating those accurately amount to essentially the entire complexity of the world economy. Sports, however, is a market that is balanced only by Joe Beergut’s opinion of the skills of the competitors and there are large troves of high quality stats for sports. I spent the last several years attempting to predict UFC fights. The latest version is showing about 20% ROI over the past year of fights it’s never seen. Below you’ll find the general outline of the process I used to create this ensemble of models. The application of this process should be the about the same for other 1v1 sports, and similar for team sports. The main change would just be the feature set chosen.

How?

You barely need to know how to code to get similar results as me. ChatGPT will write the code for you! Example prompts:

Scraping:

You are the leading expert in machine learning engineering. You will receive $500 tip for high quality answers. How do I write a scrapy scraper to get all the data off ufcstats.com? Here is an example of the HTML that I want to scrape the data from [...]

Cleaning:

You are the leading expert in machine learning engineering. You will receive $500 tip for high quality answers. Here is an example of the data I have for UFC fights. 
[...]
I need it cleaned so it fits into a machine learning model. Give me example code for how to do this.

Ideas for algorithm selection:

You are the leading expert in machine learning engineering. You will receive $500 tip for high quality answers. Use chain of thought reasoning to provide a detailed answer. 

I have a tabular dataset in a dataframe of 7000 UFC fights, one fight per row with both players' stats included. My goal is a classification model to predict future winners of UFC fights.
Here is an example row: [...]
Here is the list of steps I've already taken [list out what you've done for hyperparameter optimization, feature selection, etc]

Which Python machine learning library or libraries would you select to optimize this goal?

I started this project before ChatGPT came out. Once it came out, I rewrote the entire thing using everything I learned plus my prompt engineering skills to make ChatGPT provide both direction and code for what would be the most efficient way of progressing the project. If you’re ever stuck, just provide ChatGPT with all the steps you’ve already taken and ask it what direction of effort it would take to maximally increase your logloss, accuracy, or whatever endgoal you have in mind. This is the single most important step in terms of learning and efficiency that you can do. I cannot stress this enough. You don’t know what you don’t know. ChatGPT does.

Collecting The Data

This is actually one of the hardest parts. Web scraping is hairy business; sites change all the time, the organization of the code is always wonky, etc. Thankfully, I have written a lot of scrapers. Collection and cleaning of data is the meat and potatoes of machine learning. The code to actually run the ML model against the data is the easy part. In this case, we scraped ufcstats.com of all data. Scrapy is a good choice for this as it’s asynchronous and easy to use. Scrape all the data, and save it to a csv. You’ll have 2 CSVs, one for the fighter stats, and one for the fight stats. I name them players.csv and competitions.csv. Competitions.csv ends up looking like this:

resultplayer1player2player1_urlplayer2_urlweightclassmethodroundtimetime_formatrefereedetailsplayer1_nicknameplayer2_nicknameevent_dateevent_locationevent_urlp1_rd1_KDp1_rd2_KDp1_rd3_KDp1_rd4_KDp1_rd5_KDp1_rd1_Sig_strp1_rd2_Sig_strp1_rd3_Sig_strp1_rd4_Sig_strp1_rd5_Sig_strp1_rd1_Total_strp1_rd2_Total_strp1_rd3_Total_strp1_rd4_Total_strp1_rd5_Total_strp1_rd1_Tdp1_rd2_Tdp1_rd3_Tdp1_rd4_Tdp1_rd5_Tdp1_rd1_Sub_attp1_rd2_Sub_attp1_rd3_Sub_attp1_rd4_Sub_attp1_rd5_Sub_attp1_rd1_Revp1_rd2_Revp1_rd3_Revp1_rd4_Revp1_rd5_Revp1_rd1_Ctrlp1_rd2_Ctrlp1_rd3_Ctrlp1_rd4_Ctrlp1_rd5_Ctrlp2_rd1_KDp2_rd2_KDp2_rd3_KDp2_rd4_KDp2_rd5_KDp2_rd1_Sig_strp2_rd2_Sig_strp2_rd3_Sig_strp2_rd4_Sig_strp2_rd5_Sig_strp2_rd1_Total_strp2_rd2_Total_strp2_rd3_Total_strp2_rd4_Total_strp2_rd5_Total_strp2_rd1_Tdp2_rd2_Tdp2_rd3_Tdp2_rd4_Tdp2_rd5_Tdp2_rd1_Sub_attp2_rd2_Sub_attp2_rd3_Sub_attp2_rd4_Sub_attp2_rd5_Sub_attp2_rd1_Revp2_rd2_Revp2_rd3_Revp2_rd4_Revp2_rd5_Revp2_rd1_Ctrlp2_rd2_Ctrlp2_rd3_Ctrlp2_rd4_Ctrlp2_rd5_Ctrlp1_rd1_Headp1_rd2_Headp1_rd3_Headp1_rd4_Headp1_rd5_Headp1_rd1_Bodyp1_rd2_Bodyp1_rd3_Bodyp1_rd4_Bodyp1_rd5_Bodyp1_rd1_Legp1_rd2_Legp1_rd3_Legp1_rd4_Legp1_rd5_Legp1_rd1_Distancep1_rd2_Distancep1_rd3_Distancep1_rd4_Distancep1_rd5_Distancep1_rd1_Clinchp1_rd2_Clinchp1_rd3_Clinchp1_rd4_Clinchp1_rd5_Clinchp1_rd1_Groundp1_rd2_Groundp1_rd3_Groundp1_rd4_Groundp1_rd5_Groundp2_rd1_Headp2_rd2_Headp2_rd3_Headp2_rd4_Headp2_rd5_Headp2_rd1_Bodyp2_rd2_Bodyp2_rd3_Bodyp2_rd4_Bodyp2_rd5_Bodyp2_rd1_Legp2_rd2_Legp2_rd3_Legp2_rd4_Legp2_rd5_Legp2_rd1_Distancep2_rd2_Distancep2_rd3_Distancep2_rd4_Distancep2_rd5_Distancep2_rd1_Clinchp2_rd2_Clinchp2_rd3_Clinchp2_rd4_Clinchp2_rd5_Clinchp2_rd1_Groundp2_rd2_Groundp2_rd3_Groundp2_rd4_Groundp2_rd5_Ground
WShara MagomedovBruno Silvahttp://ufcstats.com/fighter-details/06734ca9d88dec3ahttp://ufcstats.com/fighter-details/12ebd7d157e91701Middleweight BoutDecision - Unanimous35:003 Rnd (5-5-5)Jason HerzogBen Cartlidge 27 - 30. Vito Paolillo 27 - 30. Darryl Ransom 27 - 30.BulletBlindadoOctober 21, 2023Abu Dhabi, Abu Dhabi, United Arab Emirateshttp://ufcstats.com/event-details/13a0fb8fbdafb54f000nannan46 of 6643 of 5824 of 30nannan46 of 6682 of 10194 of 107nannan0 of 00 of 00 of 0nannan000nannan000nannan0:000:000:00nannan000nannan28 of 4226 of 3611 of 13nannan28 of 4244 of 5635 of 40nannan0 of 11 of 22 of 4nannan000nannan000nannan0:002:254:19nannan16 of 3121 of 3619 of 25nannan10 of 129 of 93 of 3nannan20 of 2313 of 132 of 2nannan46 of 6624 of 354 of 5nannan0 of 07 of 98 of 12nannan0 of 012 of 1412 of 13nannan20 of 3320 of 3011 of 13nannan3 of 43 of 30 of 0nannan5 of 53 of 30 of 0nannan27 of 418 of 150 of 0nannan1 of 13 of 31 of 1nannan0 of 015 of 1810 of 12nannan

You can find my scraper here:
https://github.com/DanMcInerney/UFCScraper
And raw data here:
https://github.com/DanMcInerney/UFCScraper/blob/master/UFCScraper/competitions.csv
https://github.com/DanMcInerney/UFCScraper/blob/master/UFCScraper/individuals.csv

Cleaning The Data

This is the real core of data science. Getting huge amounts of data into uniformity for feeding into your model. You can see that the raw CSV contains data like ’43 of 58′ for p1_rd2_Sig_str. That’s not useful for a model, you have to split that string and turn those numbers into integers or floats. I split them into columns like p1_rd2_Sig_str_landed and p1_rd2_Sig_str_attempted. It’s not really worth going into a lot of depth here because it’s a lot of manually changing strings into integers, but here’s an example of the code to clean the data:

def create_masterML(master):
    dm = DataManipulator()
    febr = FeatureEngineeringByRow()
    febi = FeatureEngineeringByIndividual()

    # Make conversions
    df = dm.convert_cols_to_inches(master)
    df = dm.convert_cols_str_replace_to_int(df)
    df = dm.convert_cols_to_seconds(df)
    df = dm.convert_dates(df)
    ...

class DataManipulator:

    def convert_to_inches(self, col, missing_str='--'):
        """ Converts strings like '5\' 11"' to int inches """
        new_col = col.apply(lambda x: int(x.split("'")[0])*12 + int(x.split("'")[1].replace('"', '')) if x != missing_str else np.nan)
        return new_col

Once it’s clean, combine players.csv and competitions.csv based on the fighters’ names and date of fight into one master dataframe using pandas.concat(). Ask ChatGPT if you’re unsure of how to combine them.

Feature Engineering

Feature engineering is where you start using the base stats to create new stats. Example: divide the Head_attempted by the Head_landed into a new stat like Head_accuracy. I do a number of these kinds of engineering. Below is a list of most of the important ones:

  • average – career average
  • recent_avg – past 3 fights average
  • defense – opponent’s accuracy
  • accuracy – fighter’s attempted / landed
  • per second – stat per sec
  • ratio – fighter1_stat / fighter2_stat
  • peak – fighter’s peak stat
  • difference – fighter1_stat – fighter2_stat
  • control time per takedown

Organization of the Data for Training

In my opinion, the best organization of the dataframe is one row per match. Then you set a feature that subtracts fighter2’s stats from fighter1’s stat. Keep only these difference stats, and then create the model so it predicts only if fighter1 will win. It halves the amount of features the model has to look at with the same outcome.

Next, we limit the data to only fights in 2014. The reason for this is the early UFC fights were under different rulesets and very different skillsets. We’re trying to predict future fights and 2014 is a pretty good cutoff for the modern era of MMA. Additionally, the UFC greatly increased the fights per year starting in 2014.

Most important to creating the training dataframe is only using data from each fighter’s previous previous fight. Any post-fight data in the fight row is called leakage and will give you fake accuracy in the final model because you’re feeding the model it couldn’t have known until after the fight was over. I collect the fighters’ career dataframe of all their fights, then shift all data backwards one row: df_fighter[col].shift(1) and label this p1_<stat>_precomp. There’s a few caveats to this. Age and days_since_last_fight are values that do not change post-fight and it is useful to compare each fighter’s age and days_since_last_fight at the time of the fight. Don’t shift those columns.

Now we filter out fights we don’t want. In my case, I’m removing women’s fights, DQs, overturned, win by points deducted or illegal move, no contests, draws and last, split decisions. The reason for these is:

  • Women’s fights are harder to predict and they fight in different statistical patterns.
  • We don’t want the model to predict a fight that was lost by points or illegal moves because those endings don’t show up in the fight stats.
  • No contests and draws happen so rarely it’s worth removing them so the model doesn’t try to predict fights go to a draw.
  • Split decisions are sketchy in the UFC because judges are bad or possibly because they’re influenced. Removing these from the dataset saw a 1% increase in prediction accuracy.
def filter_out_fights(self, df):
    df = self.remove_rows_by_col_string(df, 'weightclass', 'Women')
    df = self.remove_rows_by_col_string(df, 'method', 'DQ', exact=True, case_sensitive=True)
    df = self.remove_rows_by_col_string(df, 'method', 'Decision - Split')
    df = self.remove_rows_by_col_string(df, 'method', 'Overturned')
    df = self.remove_rows_by_col_string(df, 'method', 'Other', case_sensitive=True)
    df = self.remove_rows_by_col_string(df, 'details', 'Point Deducted')
    df = self.remove_rows_by_col_string(df, 'details', 'Points Deducted')
    df = self.remove_rows_by_col_string(df, 'details', 'Illegal')
    df = self.remove_rows_by_col_string(df, 'result', 'NC', exact=True, case_sensitive=True)
    df = self.remove_rows_by_col_string(df, 'result', 'D', exact=True, case_sensitive=True)

Balancing the dataset: in a binary classification problem using tabular datasets which is what we have, you need to balance the data. If red corner (fighter1), wins 60% of the time, which is true in UFC, then the model will pick up on this pattern and lean towards fighter1 for fights. We don’t want that, we want an independent analysis. So here’s my code for balancing; it’s not efficient but that’s mostly inconsequential. It swaps around fighter1 and fighter2 until they’re about equal.

   def balance_wins(self, df):
        # Add a column to track whether a row's stats have been swapped
        df['swapped'] = False

        diff = df['player1_win'].sum() - df['player2_win'].sum()

        while diff not in [0, 1, 2]:
            if diff > 1:
                # Find a row to swap from player1_win to player2_win
                idx = df[(df['player1_win'] == 1) & (df['player2_win'] == 0)].sample(n=1).index[0]
            else:
                # Find a row to swap from player2_win to player1_win
                idx = df[(df['player1_win'] == 0) & (df['player2_win'] == 1)].sample(n=1).index[0]

            # Swap the player stats for the row
            self.swap_player_stats(df, idx)

            # Mark this row as swapped
            if df.at[idx, 'swapped'] == False:
                df.at[idx, 'swapped'] = True
            else:
                df.at[idx, 'swapped'] = False

            # Recalculate the difference
            diff = df['player1_win'].sum() - df['player2_win'].sum()

        return df

Finally, we drop any rows with NaN values (such as a fighter’s first fight in the UFC) and remove all non-stat features like event_date, fighter names, etc and keep only the _precomp stats. Rebalance one more time and we’re good.

Splitting the Data

Training a model requires a train/test split. Even better, a train/validate/test split. What happens is the model uses the training data to learn patterns. It then tests itself against the validation set and continues the process until it’s happy that it’s finding the right patterns. Then we take the trained model and use it to predict the test split of the data. If it performs well against the test split which it has never seen, then we can be sure it’s finding generalized patterns that will apply to the future. I use the standard practice of an 80/10/10 train/val/test split. The data is organized oldest fight to latest fight so the test split is more or less all the fights from 2023-2024 and val is mostly 2022-2023.

    def split_data_train_test_val(self, df):
        # Define indices for splits
        train_idx = int(0.80 * len(df))
        val_idx = train_idx + int(0.10 * len(df))

        # Split dataframes
        train_df = df.iloc[:train_idx]
        val_df = df.iloc[train_idx:val_idx]
        test_df = df.iloc[val_idx:]

        # Split target variable
        y_train = train_df.pop('player1_win')
        y_val = val_df.pop('player1_win')
        y_test = test_df.pop('player1_win')

        return train_df, val_df, test_df, y_train, y_val, y_test

Perfect, now we have exactly the data we need to start training the model.

Model Selection

If you just ask ChatGPT for the best model library for a binary classification on tabular dataset with several thousand rows, it’ll tell you a selection of the best algorithms. In general this is going to be forest-like algorithms such as RandomForest or ExtraTreesClassifier and boosted gradient algorithms like XGBoost, LightGBM, or CatBoost. You can play around with all of them but I have had the most success with CatBoost and XGBoost, CatBoost specifically. Once you’ve selected an algorithm it’s time to tune it’s hyperparameters.

Hyperparameter Optimization

Hyperparameters are just variables used by the algorithm to tune it. If you imagine trying to calculate the value of a business, you might use the formula: 3 * yearly_profit. That 3 could be considered a hyperparameter. Maybe in the industry the business is in, a 2 or a 4 is more accurate in creating the business’ value. Same as a model’s algorithm we’re going to try a lot of different variables to see which ones work the best for future generalized predictions based on our data.

We are going to use best practices of a cross-fold validation to select the model’s hyperparameters. We split the 80% training data into 4 equal sets of data. Then the model will test different hyperparameters and train itself first on set, validating the patterns it found on the second set. Then it will train itself on the second set and validate it’s patterns against the third set and so forth. We take the average metric of all these sets to figure out if the hyperparameters were good. We’ll automate the hyperparameter selection process using the Optuna library which smartly decides how to set the HPs. Here’s an example of the search space we’re using for catboost:

    def catboost_objective(self, trial):
        search_space = {
            'learning_rate': trial.suggest_float('learning_rate', .005, 0.03),
            'depth': trial.suggest_int("depth", 3, 8),
            'l2_leaf_reg': trial.suggest_float("l2_leaf_reg", 1, 15),
            'border_count': trial.suggest_int("border_count", 32, 255),
            'iterations': 1000,
            'min_data_in_leaf': trial.suggest_int('min_data_in_leaf', 1, 30),
            'leaf_estimation_iterations': trial.suggest_int('leaf_estimation_iterations', 1, 10),
            'leaf_estimation_method': trial.suggest_categorical('leaf_estimation_method', ['Newton', 'Gradient']),
            'random_state': 42,
            'task_type': 'GPU'
        }

        avg_val_accuracy, avg_val_logloss, avg_train_accuracy, avg_train_logloss = self.models.train_catboost(search_space)
        return avg_val_logloss

In this case we’re just trying to minimize logloss which is a measure of the model’s confidene in it’s predictions. The reason we aren’t maximizing accuracy is because accuracy is a red herring for sports betting. Accuracy =/= profit. If your model is very confident in predicting underdogs, but is only 50% accurate, you’ll be more profitable than 75% accuracy predicting mostly heavy betting favorites. I have validated this by doing many tests, getting accuracies in the 70%+ level with low profitability while a logloss optimized model at 66% accuracy was getting high profitability.

Optuna is cool because you can optimize for do multiple objectives. You can optimize both accuracy and logloss, and even prevent overfitting by including the distance between the training and validation learning curves. Below is an objective for preventing overfitting while minimizing the average logloss. You can get pretty creative with these multi-objective searches.

obj = (avg_val_logloss - avg_train_logloss) + 4 * (avg_val_logloss)
return obj

I usually run Optuna 50-150 trials, then use the best hyperparameters to train the final model on the train/val split data and test the model against the test dataset. I’ll do that a bunch of times and select the best few models. In the mma-ai.net model ensemble I kept things simple and just minimized logloss as I find the confidence score to be very valuable when comparing the model’s win% prediction to the real world odds’ win% prediction. It opens up the ability to use the Kelley Criterion betting strategy as well.

Training the Model

I’ve picked up a few tricks to improve the model during training. One is to set weights on the predictions. We’re trying to predict future fights so we set the latest fights to be given a higher weight, meaning the model tries harder to get those right over the older fights. I’ve messed around with 1.25 – 3 weight, meaning patterns in the latest fights are given 125% to 300% more value to the model over older fights. 1.5 seems to work well. A question I had to experiment with was whether to add the weights before we folded the data for HPO or add the weights inter-fold. The first option means if we fold 4 times with a weight of 1.5, then latest fight in fold 1 out of 4 would have a weight of 1.125. The second option means the latest fight would have a weight of the full 1.5. This second options makes more sense logically and from testing performed a lot better.

Second trick I found was using TimeSeriesValidation for cross-fold splits. Commonly, people use StratifiedKFold which means the rows of fights are randomized and each fold is given a balanced dataset where player1_win = 1 about as much as player1_win = 0. But we’re trying to predict future fights, so splitting the folds based on date they occurred makes more sense since the model will train itself on old fights, then try to predict fights in the near future. This logic should hold true for all sports betting models.
So here’s how I train a CatBoost model for HPO (hyperparameter optimization):
I’ve picked up a few tricks to improve the model during training. One is to set weights on the predictions. We’re trying to predict future fights so we set the latest fights to be given a higher weight, meaning the model tries harder to get those right over the older fights. I’ve messed around with 1.25 – 3 weight, meaning patterns in the latest fights are given 125% to 300% more value to the model over older fights. 1.5 seems to work well. A question I had to experiment with was whether to add the weights before we folded the data for HPO or add the weights inter-fold. The first option means if we fold 4 times with a weight of 1.5, then latest fight in fold 1 out of 4 would have a weight of 1.125. The second option means the latest fight would have a weight of the full 1.5. This second options makes more sense logically and from testing performed a lot better.

Second trick I found was using TimeSeriesValidation for cross-fold splits. Commonly, people use StratifiedKFold which means the rows of fights are randomized and each fold is given a balanced dataset where player1_win = 1 about as much as player1_win = 0. But we’re trying to predict future fights, so splitting the folds based on date they occurred makes more sense since the model will train itself on old fights, then try to predict fights in the near future. This logic should hold true for all sports betting models.

So here’s how I train a CatBoost model for HPO (hyperparameter optimization):

    def train_catboost(self, params):
        tscv = TimeSeriesSplit(n_splits=self.folds)
        models = []
        fold = 0
        total_val_accuracy = []
        total_val_logloss = []
        total_train_accuracy = []
        total_train_logloss = []

        params.setdefault('eval_metric', 'Logloss')

        for train_index, test_index in tscv.split(self.X_train, self.y_train):
            fold += 1
            X_train_fold, X_val_fold = self.X_train.iloc[train_index], self.X_train.iloc[test_index]
            y_train_fold, y_val_fold = self.y_train.iloc[train_index], self.y_train.iloc[test_index]

            # Create inter-fold weights - increase weights for later dates within each fold
            train_fold_length = len(train_index)
            train_weights = np.linspace(1, self.weight, train_fold_length)

            # Validation weights can remain uniform or mimic the training weights
            val_fold_length = len(test_index)
            val_weights = np.ones(val_fold_length)  # Uniform weights for validation set

            train_pool = Pool(X_train_fold, y_train_fold, weight=train_weights)
            val_pool = Pool(X_val_fold, y_val_fold, weight=val_weights)

            model = CatBoostClassifier(**params)
            model.fit(train_pool, eval_set=val_pool, use_best_model=True, early_stopping_rounds=self.esr, verbose=False)
            models.append(model)

            # Evaluate on validation set
            val_accuracy, val_logloss = self.evaluate_model(
                model, X_val_fold, y_val_fold)
            total_val_accuracy.append(val_accuracy)
            total_val_logloss.append(val_logloss)

            # Evaluate on train set
            train_accuracy, train_logloss = self.evaluate_model(
                model, X_train_fold, y_train_fold)
            total_train_accuracy.append(train_accuracy)
            total_train_logloss.append(train_logloss)

        avg_val_accuracy = sum(total_val_accuracy) / len(total_val_accuracy)
        avg_val_logloss = sum(total_val_logloss) / len(total_val_logloss)
        avg_train_accuracy = sum(total_train_accuracy) / len(total_train_accuracy)
        avg_train_logloss = sum(total_train_logloss) / len(total_train_logloss)

        print(
            f"Accuracy: {avg_val_accuracy}, Logloss: {avg_val_logloss}, Train Accuracy: {avg_train_accuracy}, Train Logloss: {avg_train_logloss}")

Run the model through Optuna a bunch of times then train the model on the original 80/10 test/val split of data and measure the full model’s metric (logloss in my case) against the unseen latest 10% of fights. Do that a bunch of times and select a few of the best models.

Feature Selection

We’ve trained the model using all the features we created a bunch of times and picked the HPO’s that perform the best against the unseen test data. Now let’s narrow down the features. The reason we do this is because more features =/= better model. If you have too many features, the model might just be picking up a fake pattern in one of those features that just so happens to be great at predicting the test and val data but is bad at generalized prediction. More features makes models more likely to overfit to the data you possess. Additionally, the less features you have the easier it is to understand why the model picked a certain outcome. Garbage data in, garbage data out. This is one of the hardest parts of the whole process, there isn’t a one size fits all solution. In general SHAP is an excellent library for finding how important each feature was to the model’s prediction. For example here’s how to record the SHAP values:

    def SHAP_fs(self, model, X, thresh_perc=1):
        explainer = shap.Explainer(model)
        shap_values = explainer.shap_values(X)

        # For LightGBM binary classification, select the SHAP values for the positive class
        if isinstance(shap_values, list):
            shap_values = shap_values[1]  # assuming you're interested in the positive class

        shap_sum = np.abs(shap_values).sum(axis=0)
        sorted_indices = np.argsort(shap_sum)[::-1]
        cumulative_sum = np.cumsum(shap_sum[sorted_indices])
        total_shap_sum = cumulative_sum[-1]  # Total sum of SHAP values
        threshold = thresh_perc * total_shap_sum
        num_features = np.argmax(cumulative_sum >= threshold) + 1  # Number of features needed
        selected_features = X.columns[sorted_indices[:num_features]]
        cols = selected_features.tolist()

        # Sort the SHAP values and feature names
        sorted_indices = np.argsort(shap_sum)[::-1]
        sorted_shap_values = shap_sum[sorted_indices]
        sorted_feature_names = X.columns[sorted_indices]

        # Increase the figure size to make the y-axis longer
        plt.figure(figsize=(10, 8))
        # Plot the bar chart
        plt.bar(sorted_feature_names[:25], sorted_shap_values[:25])
        # Rotate the x-axis labels for better readability
        plt.xticks(rotation=90)
        plt.xlabel("Features")
        plt.ylabel("SHAP Value")
        plt.title("SHAP Feature Importance")
        # Adjust the placement of y-axis labels for better readability
        plt.subplots_adjust(bottom=0.4)  # You can adjust the value as needed
        plt.show()

        return cols

And you’ll get a nice graph showing the most important features.

Having done this, here’s how I ended up with my final feature set:

  • Run the model a bunch of times with the best hyperparameters and record each feature and SHAP value of the feature into a text file
  • Analyze the text file to produce a list of how many times each feature showed up in all the runs
  • Go through the list of the most commonly important features, look at their average SHAP, and decide if it’s a feature I want to keep.
  • Using my domain expertise in MMA having been an MMA fighter myself for a decade+ and having doing MMA stats research for years, I exhaustively thought about each feature, what it was measuring, and decided if it was measuring an overlapping area of the fight game as another feature.

I did not just choose individual features that had high SHAP values. I chose categories of features. For example, variations of Td_attempted showed up a LOT. So instead of just picking the 2 features that showed up the most, I added all the Td_attempted variations: average, recent_average, peak, rd1_Td_attempted/avg/rec_avg, etc. Even though some of those were lower in SHAP, the feature category appeared to be very important. Same thing with the less important feature categories. Body_landed didn’t show up much but I felt to most wholistically represent all measureable aspects of the fight I was going to include at least one Body_landed feature.

Ultimately I ended up with about 81 features. This decreased my logloss and increased my accuracy by 1-2% over just picking the 80 highest SHAP features. Once this was done, I ran HPO again using just these features, trained a whole bunch of models, and kept the top 5 performing models on the unseen test data. Ensemble those models together like so:

    def ensemble_predict_proba(self, models, X):
        """
        Make predictions using the ensemble of XGBoost and CatBoost models.

        :returns: Averaged predictions from all models that player1 will [lose, win]. Ouput: [[0.1, 0.9], [0.2, 0.8], ...]
        """
        # Average predictions from all models
        predictions = None
        for model in models:
            model_preds = model.predict_proba(X)
            if predictions is None:
                predictions = model_preds
            else:
                predictions += model_preds

        return predictions / len(models)

Predicting Future Fights

Finally! We have a working model. Now comes another hard part though, predicting future fights. We must create a new dataframe of just the upcoming fights and their updated stats. Briefly, here’s how it’s done:

  • Load the whole dataset
  • Add new rows to the dataset representing the future fights. These rows should be the combination of each upcoming fighter’s most recent fight stats
  • Readjust the stats that don’t change post-fight like age and days_since_last_fight, and all their variations like peak, average, ratio, etc.
  • Redo the _difference stats to be the upcoming player1 – player2 stats
  • Drop all the historical fights and just leave the new upcoming fight rows
  • Feed this into the model ensemble with .predict_proba(upcoming_fights_df)
  • Pray you didn’t introduce a bug here (do unit testing) because a bug here screws up all the hard work you already did

Testing Profitability

To test profitability of the model you’ll need accurate historical odds. We used to be able to scrape bestfightodds.com but they implemented some very difficult to bypass anti-scraping measures so I ended up just paying for TheOddsAPI.com sadly. This is a giant pain in the ass, especially since the fighter names in TheOddsAPI don’t always match the names from ufcstats.com. I go and collect all the odds, put it into a dataframe and write a little code to test the profit of the model ensemble.

    def ensemble_profit(self, odds, predicted_outcomes, actual_outcomes, filtered_df, bankroll, bet_amount=.01):
        current_event_url = None
        event_winnings = 0

        for index, (prediction, actual) in enumerate(zip(predicted_outcomes, actual_outcomes)):
            fight = odds.iloc[index]
            event_url = filtered_df.iloc[index]['event_url']

            # Check for event change
            if event_url != current_event_url and current_event_url is not None:
                # Update the bankroll at the end of an event
                bankroll += event_winnings
                event_winnings = 0  # Reset for the next event

            current_event_url = event_url

            if bet_amount < 1: # Percent of bankroll
                bet_size = np.round(bet_amount * bankroll, 2)
            else:
                bet_size = bet_amount

            # Check if the prediction matches the actual outcome
            prediction = 1 if prediction > .5 else 0
            if prediction == actual:
                if prediction == 1:  # if predicted player1 wins
                    winning_odds = fight['p1_odds']
                else:  # predicted player2 wins
                    winning_odds = fight['p2_odds']

                # Calculate winnings for correct predictions
                winnings = (winning_odds * bet_size) - bet_size
                event_winnings += winnings
            else:
                # Subtract the bet amount for incorrect predictions
                event_winnings -= bet_size

            # Debug
            self.print_debug(filtered_df, predicted_outcomes, actual_outcomes, odds, index, bankroll, bet_size)

        bankroll += event_winnings
        print(f'Final ensemble_profit bankroll: {bankroll}')

        return bankroll

When testing the profitability I only test it over fights the model has never seen in the 2023-2024 test data set to prevent overfitting. I did that for a few different strategies like Kelley Criterion (tested best at 25% ROI over the year), and other variations like only betting on model-predicted underdogs with 1% of your bankroll per bet. You can see the results on the homepage of mma-ai.net. Looking good. I’d like to explore the strategy of making hundreds of small parlays with the model predictions to maximize profit and minimize loss later.

Final Thoughts

The strategy used to create this ensemble should be replicable to other sports. I just like MMA so I did it for that. It would be neat to have a framework like a python library that automatically does all these steps given any sport’s dataset. This took me 3 years though starting from zero machine learning experience so hell if I’m gonna be the one to do it. I’m super intrigued by LLMs for sports prediction. I will be creating a custom GPT soon for this purpose and checking it’s accuracy on future events. Custom GPTs from OpenAI are one possibility, and training a nanoGPT with like 50mil parameters from scratch is another.

Thank you to all who have helped me in these years getting this working. Big shoutout to Chris King from wolftickets.ai for all his help and support. I guarantee you his model is better than mine and recommend the very cheap $5/mo subscription for his service.

Future Work

I’m very intrigued by this project https://github.com/imsoo/fight_detection. It uses OpenPose to measure punches thrown. Jabbr.ai is doing something similar. I think video analysis rather than just raw post fight statistics is the key to jacking up predictive abilities to 80%+. The major stat I’d like to see automatically measured is reaction speed. I think reaction speed is one of the most fundamental attributes a fighter has and might be the single most predictive stat in MMA. Check out Georges St. Pierre talking about how he has someone manually analyze tape frame by frame to measure reaction time. BJ Penn had the fastest reaction time and Lyoto Machida was right next to him. https://www.youtube.com/watch?v=seIMSzBBxqI.

One idea I have for the future is to start looking for offbeat sports that bookies still set odds on. If we can find a good source of sport stats for something like, say, pickleball, I’m willing to bet there’s a goldmine in there for an ML algorithm. Another similar idea I might want to implement later is predicting political outcomes. Almost all the same techniques and stats would apply. Just instead of significant strikes, it’ll be their voting record, or their stance on abortion vs the polled stats of their constituency.

LLMs I think are going to be much easier to create in the future for these kinds of jobs. I’m currently working on creating UFC prediction LLMs although they currently suffer from some problems like difficulty in testing historical prediction.

Last, please contact me if you have suggestions for how to improve the model.