
Data Prep / EDA
The Journey of Gathering, Cleaning, Preping, and Exploring Ultrarunning Data
Gathering Data
The primary source of data is a large dataset containing over 7 million records of ultramarathon race results. This data is accessed through: https://www.kaggle.com/datasets/aiaiaidavid/the-big-dataset-of-ultra-marathon-running
Original dataset: 7million rows of unclean csv
The second source of data gathered was historical weather data corresponding to the weather during specific races from the above dataset. This data was gathered using WeatherStack API:
Website: https://weatherstack.com/
Core Endpoint: https://api.weatherstack.com/historical
URL GET example: https://api.weatherstack.com/historical?access_key=2d41f47d58a932107f900dedc410321a&query=Jackson%20WY&historical_date=2015-08-19
Snipit of pulled API weather data converted to CSV
Cleaning Example: Distance Variable
Variable came with a variety of issues:
Non-numeric values
Races in Kilometers and in Miles
Unrealistically large values
Non-distance values
Additional string characters
Began with 2,159 unique race distances
Cleaning:
Converted column to string
Removed erroneous characters
Converted miles to kilometers
Replaced illogical distances with logical distances where applicable (or deleted the row)
Converted to float
Created a new column with rounded distances to reduce the number of different race distances (i.e. 49.5 vs 50)
Created a discretized variable for future analysis
Cleaning resulted in 90 unqiue race distances (non-discretized)
Image shows examples of pre-cleaned entrants from the event distance column.
Note: Screenshot is of a section of an array of the variable, screenshots of the entire dataset are to illegible given the size of the data set
Example of cleaned race distances
Similar Cleaning Process Performed on Other Variables - See Linked Code for full Cleaning Process
Brief Overview of Additional Data Prep
Including: Variable Creation, Data Frame Subsets, API Data Integration
API Integration
Data Prep to avoid the calling of 7,000,000 GETs. Included:
The creation a dataframe subset containing unique 100 mile race events. 200 such events with over 27,000 corresponding records
Date variable manipulation to align with API formatting requirements
Independent search for the location of each unique event
After the API GETs were called, this data then had to be converted from JSON to CSV, and integrated into to the rest of the dataset
Data Frame Subset Examples
Much of the future analysis is interested in differences in race performance based on gender, age, and race day weather conditions:
New dataframes were created consisting of just the 1st place Male and Female finishers for every unique event
Dataframes consisting of the average speed per discretized age group per discretized race distance
Dataframes were created to isolate if weather conditions impacted finishing times
Variable Creation Examples
Difference in average speed between male and female race winners
Athlete Age on race day
Discretized Age Groups
Race finishing time converted to hours
See introduction tab for exploratory data visualizations - which are a very informative and crucial part of this process
(Visualizations are in the Introduction tab per Syllabus Instruction)
See linked code for further insight into this data cleaning and preperation process