Data Prep / EDA

The Journey of Gathering, Cleaning, Preping, and Exploring Ultrarunning Data

Gathering Data

The primary source of data is a large dataset containing over 7 million records of ultramarathon race results. This data is accessed through: https://www.kaggle.com/datasets/aiaiaidavid/the-big-dataset-of-ultra-marathon-running

Original dataset: 7million rows of unclean csv

The second source of data gathered was historical weather data corresponding to the weather during specific races from the above dataset. This data was gathered using WeatherStack API:

  • Website: https://weatherstack.com/

  • Core Endpoint: https://api.weatherstack.com/historical

  • URL GET example: https://api.weatherstack.com/historical?access_key=2d41f47d58a932107f900dedc410321a&query=Jackson%20WY&historical_date=2015-08-19

Snipit of pulled API weather data converted to CSV

Cleaning Example: Distance Variable

Variable came with a variety of issues:

  • Non-numeric values

  • Races in Kilometers and in Miles

  • Unrealistically large values

  • Non-distance values

  • Additional string characters

  • Began with 2,159 unique race distances

Cleaning:

  • Converted column to string

  • Removed erroneous characters

  • Converted miles to kilometers

  • Replaced illogical distances with logical distances where applicable (or deleted the row)

  • Converted to float

  • Created a new column with rounded distances to reduce the number of different race distances (i.e. 49.5 vs 50)

  • Created a discretized variable for future analysis

  • Cleaning resulted in 90 unqiue race distances (non-discretized)

Image shows examples of pre-cleaned entrants from the event distance column.

Note: Screenshot is of a section of an array of the variable, screenshots of the entire dataset are to illegible given the size of the data set

Example of cleaned race distances

Similar Cleaning Process Performed on Other Variables - See Linked Code for full Cleaning Process

Brief Overview of Additional Data Prep

Including: Variable Creation, Data Frame Subsets, API Data Integration

API Integration

Data Prep to avoid the calling of 7,000,000 GETs. Included:

  • The creation a dataframe subset containing unique 100 mile race events. 200 such events with over 27,000 corresponding records

  • Date variable manipulation to align with API formatting requirements

  • Independent search for the location of each unique event

After the API GETs were called, this data then had to be converted from JSON to CSV, and integrated into to the rest of the dataset

Data Frame Subset Examples

Much of the future analysis is interested in differences in race performance based on gender, age, and race day weather conditions:

  • New dataframes were created consisting of just the 1st place Male and Female finishers for every unique event

  • Dataframes consisting of the average speed per discretized age group per discretized race distance

  • Dataframes were created to isolate if weather conditions impacted finishing times

Variable Creation Examples

  • Difference in average speed between male and female race winners

  • Athlete Age on race day

  • Discretized Age Groups

  • Race finishing time converted to hours

See introduction tab for exploratory data visualizations - which are a very informative and crucial part of this process

(Visualizations are in the Introduction tab per Syllabus Instruction)

See linked code for further insight into this data cleaning and preperation process