Data Prep / EDA

The Journey of Gathering, Cleaning, Preping, and Exploring Ultrarunning Data

Gathering Data

The primary source of data is a large dataset containing over 7 million records of ultramarathon race results. This data is accessed through: https://www.kaggle.com/datasets/aiaiaidavid/the-big-dataset-of-ultra-marathon-running

Original dataset: 7million rows of unclean csv

The second source of data gathered was historical weather data corresponding to the weather during specific races from the above dataset. This data was gathered using WeatherStack API:

Website: https://weatherstack.com/
Core Endpoint: https://api.weatherstack.com/historical
URL GET example: https://api.weatherstack.com/historical?access_key=2d41f47d58a932107f900dedc410321a&query=Jackson%20WY&historical_date=2015-08-19

Snipit of pulled API weather data converted to CSV

Linked Code - GitHub

Cleaning Example: Distance Variable

Variable came with a variety of issues:

Non-numeric values
Races in Kilometers and in Miles
Unrealistically large values
Non-distance values
Additional string characters
Began with 2,159 unique race distances

Cleaning:

Converted column to string
Removed erroneous characters
Converted miles to kilometers
Replaced illogical distances with logical distances where applicable (or deleted the row)
Converted to float
Created a new column with rounded distances to reduce the number of different race distances (i.e. 49.5 vs 50)
Created a discretized variable for future analysis
Cleaning resulted in 90 unqiue race distances (non-discretized)

Image shows examples of pre-cleaned entrants from the event distance column.

Note: Screenshot is of a section of an array of the variable, screenshots of the entire dataset are to illegible given the size of the data set

Example of cleaned race distances

Similar Cleaning Process Performed on Other Variables - See Linked Code for full Cleaning Process

Linked Code

Brief Overview of Additional Data Prep

Including: Variable Creation, Data Frame Subsets, API Data Integration

API Integration

Data Prep to avoid the calling of 7,000,000 GETs. Included:

The creation a dataframe subset containing unique 100 mile race events. 200 such events with over 27,000 corresponding records
Date variable manipulation to align with API formatting requirements
Independent search for the location of each unique event

After the API GETs were called, this data then had to be converted from JSON to CSV, and integrated into to the rest of the dataset

Data Frame Subset Examples

Much of the future analysis is interested in differences in race performance based on gender, age, and race day weather conditions:

New dataframes were created consisting of just the 1st place Male and Female finishers for every unique event
Dataframes consisting of the average speed per discretized age group per discretized race distance
Dataframes were created to isolate if weather conditions impacted finishing times

Variable Creation Examples

Difference in average speed between male and female race winners
Athlete Age on race day
Discretized Age Groups
Race finishing time converted to hours

See introduction tab for exploratory data visualizations - which are a very informative and crucial part of this process

(Visualizations are in the Introduction tab per Syllabus Instruction)

See linked code for further insight into this data cleaning and preperation process

Linked Code