Project: Regression and marathon results

Project weight: 10 points

Download the following csv file which contains race results of about 26,000 marathon runners:

Objectives

Analyze the marathon data. In particular:

  1. The ‘naive’ way to predict finish times of the runners based on their 5K times is to multiply the 5K time by the ratio between the full marathon distance and the 5K distance. Compute such predictions and evaluate their accuracy.

  2. Use linear regression to predict finish times of runners based on their 5K times. Compare accuracy of these predictions with the accuracy of the ‘naive’ predictions.

  3. Use other data beside the 5K time (the age of a runner, whether the runner was a male or a female) together with the 5K times to predict finish times, and check if this meaningfully improves the predictions.

  4. Use logistic regression to predict whether a runner was a male or female based on their 5K time (or some other time - 10K, finish time etc.) time). Do the result improve meaningfully if the logistic regression uses age as well?

  5. Add anything else that you find relevant and interesting.

Note: You can use sklearn to compute linear and logistic regressions.