Scroll to read more

Using Data Science to win a fitness competition.

It was March 2020.  I was working on a Master’s in Data Science at Berkeley, traveling 4 days a week for work, and with the precious time I had left, I was training for an event called Hyrox.

***UPDATE***

My race was delayed significantly due to COVID, but when it finally happened, I came in 5th in Men’s Pro, and went on to race in the World Championships. If you want to learn more… read on 😉

*** END UPDATE***

If you’re not familiar, Hyrox is sort of like an obstacle course race meets Crossfit.  But this description does not really capture just how torturous the event is.  The race is made up of 8, 1km runs, separated by an absurd amount of specific movements, such as 330 feet of lunges with a 66 pound bag on your back. 

I was training when and where I could, but between work, travel, and school, my approach was haphazard and often limited by the equipment I could get my hands on each day. 

Then COVID hit.  Just weeks before the event, the race was pushed back to September, and as infection rates increased it was pushed back again to March 2021. Work and school slowed down as well. I knew that I now had time to train properly, if I could just figure out what I needed to work on.

I started rewriting my training program to focus on my weaknesses, but with 9 unique movements, it was hard to know where I stood.  So, I headed to the Hyrox website to see if they had any information on top times or splits.  To my surprise, I discovered that the Hyrox website had the full data on all 19 splits for every single athlete at every event.   

Immediately I knew I could use data science to help me figure out where my biggest gaps were, and how to make up the most time over the next few months as I trained.

The Data

The first step was gathering all of the data, and to be clear, I did not hack the Hyrox website.  Their permissions were set so I could scrape anything that was not an admin page. So I built a spider to crawl the pages, and when it was all said and done, I had 19 data points on 2,189 athletes that had raced in the “pro” division.

The Hypothesis

My theory was that was that an athlete’s overall time really came down to handful of the 9 unique movements, and that the key to optimizing my time would lie in understanding the significance of each event. 

I believed that the “variability” of an event would be an important indicator. For example, if the top athletes spent 3:30 on the SkiErg, and the middle-of-the-pack athletes spent 4:00, then I only stood to make up 0:30 going from average to great.  But if the top athletes spent 2:00 on the sled pull, and the middle-of-the-pack averaged 8:00, then I stood to make up 6:00 going from average to great there.  That spread represents the variability.  So I set out to investigate it.

The Findings

Below is a dashboard showing distribution of event times across all athletes at all events.

As I expected, there was significantly more variability in some events rather than others.  Visually it was clear that movements like the row or Ski Erg were very concentrated, while movements like sled pull and wall balls had much broader curves, meaning much more variation.  To make this even more clear, we can look at the box plots for each event.

Here we see what the histograms suggested.  Events like Sled pull and Wallballs have huge variability with high average times, making these theoretically very valuable events to focus on.

To take this a step further, I fit a multivariate regression model to measure the “significance” of each event against the athlete’s overall rank.

Now we could easily go down a rabbit-hole here discussing P-values, independence, and homoscedasticity. However, my ultimate goal with this post is not to reject a null hypothesis, but rather to explain how I built an optimized training model. So for now, all I want you to pay attention to is the number of stars in that right-hand column.  More stars = more likely to effect overall rank.

Interestingly, we see that not all runs are as likely to impact overall performance, which makes sense when you look at the events preceding each run.  For example, how well you run after lunges (run 8) gives better insights into your conditioning compared to running after the SkiErg (run 2).     

So taking variability, distribution, and correlation all into account, these are the events that initially appear to be the most important.

ExerciseVariabilityAverage DurationCorrelation to Rank
Run TotalMediumVery HighMedium
WallballsHighHighMedium
Sled PullHighHighMedium
Sandbag LungesHighMediumHigh

While these events may seem be the most important in an absolute sense, this list fails to account for the fact that getting better at any event comes up against the problem of diminishing marginal returns.  Basically, this means every event has an upper bound of performance. The better you are, the less room you have to progress, and therefore the slower the progress is going to be.  This meant that the next step was to understand where I am at now so I can gauge how I am likely to progress.

Building my fitness model

Before I could go any further, the next step was to test my current performance.  So I simulated the event, and it was terrible.  After 1:18:20 of work, I found myself laying in a ball on the ground and I think I may have been drooling… but it was worth it as I got what I needed.

 I had tracked every split and ended up with the best approximation I could get for my current times.

EventMy Current Time
Run Total41:40
Ski Erg03:44
Sled Push02:22
Sled Pull04:55
Burpee04:37
Row03:46
Farmers Carry01:49
Lunges04:18
Wall Balls06:23
Transition04:47
Total1:18:20

Next, I thought about what I could reasonably handle for training volume without getting hurt or burnt out.  I knew that I couldn’t make everything a priority, so I settled on 3 buckets:

5 Maintenance Exercises:
Maintain current performance but make slight improvements as I improve general fitness.

3 Priority Exercises:
Work on these exercises aggressively with the goal of making moderate improvements.

1 All-in Exercise:
Go all in on one exercise to make as much progress as possible.


Next I set out to build a performance curve for each exercise using the distributions from the events.  I defined an improvement function to calculate how I could expect to reduce my time on any event. This calculation took into account my current performance, the upper bound of performance on this movement, and the level of focus I applied. The improvement rates were based on data from previous training cycles, and are my little secret for now…

I applied these calculations to each event and produced estimates of my expected time reduction per event by focus level.

Once I had these data points for all 9 movements, I wrote a script to determine all unique combinations of the events across the three buckets, and the estimated time savings for each. In total there were 504 unique combinations. When it was all said and done, the program spit out the combination that should lead to the great possible improvement in my total time.

In case that makes no sense, here are the results cleaned up:

 PriorityEvent Seconds Saved
MaintenanceSki Erg1
Maintenance Sled Push2
 MaintenanceRow0
 MaintenanceFarmers Carry3
 MaintenanceSandbag Lunges6
PrioritySled Pull18
 PriorityBurpee Broad Jump24
 PriorityWall balls26
All-in Running210
TOTAL 290

The Next Steps

Now that I have a clear list of what to focus on, all I have left to do is train, right?  I could do that, but I think that with this much time before the event there is more I could figure out to optimize my performance further. 

Over the next few weeks I plan on testing some of my assumptions and the accuracy of the model I designed above.  As part of that process I am going to experiment with different training approaches across the events in an attempt to determine which is truly optimal.  I will make a follow up post in about two months to report on that process and the findings.

The last thing I will do as the race gets closer is explore race pacing.  Looking back on my splits and heart rate data, I see that I went way too hard on events that mattered very little (Ski Erg, row) and not hard enough on events that did matter (all running). The final post in this series will explore race pacing and the optimal approach.  I will make sure I get it up before the event, in case any readers out there are competing as well.  So stay tuned.

Conclusion

The question I set out to answer with this post was: “How to use data science to win a fitness competition?”  Based on my findings above, do I think I can go on to win the Hyrox event?  Nope.  At most I’ll only be able to shave off about 6 minutes, but if everything is accurate, that could put me in the top 10, and that’s still pretty exciting for me.