Bugs Fixed, Unit Testing In Place

The model takes all post fights stats in a dataframe from ufcstats.com. The simplest model to predict fights would simply shift all rows backwards one to get the fighter’s postfight stats from the previous fight and use those to predict the upcoming fight. However, I do some feature engineering that increases the accuracy like (fighter1_age / fighter2_age) which I encode in the age_differential stat. This way I can directly compare stats between two fighters before the fight. Unfortunately this adds complexity because it means we end up with three categories of stats:

  1. Stats that don’t change prefight to postfight like age, days since last competition or reach.
  2. Stats that are known prefight but change after the fight such as win streak.
  3. Stats that are only known postfight, like significant strikes landed

Each of these stats has to be handled in different ways in the upcoming fights prediction dataframe. Age_differential can be used to directly compare a fighter to his upcoming opponent without using the previous fight stats since I recalculate age before doing the _differential between the upcoming fighters. Win_streak_differential can be used to directly compare two fighters before their fight but that stat changes after the fight. That means I must recalculate win_streak_differential for each fight but I don’t recalculate things like avg_win_streak. That avg_win_streak is just the stat from the previous fight. And therein lied the bug. In the training dataset, win_streak_differential was being calculated using the postfight stats the same as the prediction dataframe was calculating it, meaning the training data could peak into the outcome of the fight. This was a bug introduced in the previous update to the model when I added stat_peak and stat_valley features.

I ended up finding and fixing this bug by using unit tests. Unit tests are little checks you do to confirm a function is performing what you expect it to perform. In this case, I checked that the win_streak_differential was actually measuring the prefight winstreak between the fighters in the training dataset. I implemented a lot of other checks and this was the only bug found so I feel fairly confident the rest of the data is being handled appropriately. I focused effort strongly on the stats that handle the records of the fighters such as KO_losses, win_loss_ratio, etc, as those have the biggest effect should they leak postfight data to the training dataset.

Ultimately, this brought the accuracy down from 71%+ to 68% but that accuracy number makes sense given the many other UFC prediction models I’ve researched and still puts it in the top 2 most accurate UFC prediction models that are publicly known.