Principal Component Analysis (PCA)

What is PCA?

PCA (Principle Component Analysis), is a process of transforming quantitative, uncorrelated, features into orthogonal principle components. This technique is often used for dimensionality reduction. By transforming variables into principle components, the maximum amount of data variance can be retained while reducing the number of features in the data. The process beings with the standardization of data – every feature’s mean becomes 0 and variance becomes 1 – this standardization ensures that no single feature impacts the principal component (PC) more forcefully. The covariance matrix is then examined to determine feature relationships. Next, the eigenvalues and eigenvectors are calculated to identify the PCs that capture the most variance from the original features. Upon examination of the eigenvalues, the top PCs are selected to reduce the data to the required dimensions. Finally, the original dataset will be projected onto the newfound lower-dimensionality space. Through this process, data scientists are able to reduce the complexity of data while maximizing information retention.

Link to Data

Snipit of Data to Apply PCA to

Snipit after Standardization

My PCA Data

Below PCA will be applied to quantitative features from the ultra-running dataset. The label (which is removed for PCA) is finishing hours. The remaining data is 6 dimensional (see screenshot to the right). This transformation will be very helpful going forward as perfroming further analysis on data with reduced dimensionalty will hopefully improve performance and interpretability

Graphical Representation of data projected in 2 dimensions

PCA with 2 Components

To the left is the graphical representation of the data set transformed into two principal components. The visualization is color coded to represent the values of the previously removed label. It is clear that the PCs have retained valuable information as the labeled values appear to be clustered with like values.

As clear from the below produced code output, the newly transformed two-dimensional data captures approximately 54% of the total variation existing within the data set. The first PC captures 32.7% and the second 21.6%. These two PCs are the most important as they capture the most variance from the data, each successive PC will capture less and less variance.

Link to Code

PCA with 3 Components

To the right is the graphical representation of the data set transformed into three principal components. The visualization is color coded to represent the values of the previously removed label.

As clear from the below produced code output, the newly transformed three-dimensional data captures approximately 74% of the total variation existing within the data set. The first PC captures 32.7%, the second 21.6%, and the third 17%. Similarly to as mentioned with 2 PCs, the visualized 3 PCs are the three most valuable PCs, maximizing the variance explained by 3 features.

Link to Code

Graphical Representation of data projected in 3 dimensions

In Order to retain 95% of variability we require ~5 PCs

5 PCs account for over 94% of the total variance exhibited in the data. 6 components will obviously cover 100% of the variation given that the data being used is 6 dimensions. Therefore in order to capture ~95% of the variability the dimensionality of the data is only reduced by 1.

When moving forward with further analysis it will be very useful use transformed data with 2 or 3 principal components. This transformation retains much of the data’s relevant information and variance while greatly reducing data dimensionality and complexity enabling improved future analysis of the ultra-running race results.