Association Rule Mining (ARM)

Overview

Association Rule Mining (ARM) is an unsupervised machine learning technique that aims to discover patterns and associations between different items in datasets. ARM requires transaction data (unlike PCA and Clustering techniques explored in previous tabs).

ARM finds rules from the data which are patterns that suggest strong relationships between different items. When examining rules it is common to evaluate the rule’s support, confidence, and lift. Support refers to how frequently an itemset occurs in the dataset relative to all total transactions. Confidence measures how often item B occurs given that A is already present. Finally lift measures how much more likely item A and B occur together versus being independent. A lift of 1 corresponds to A and B being independent. A score greater than 1 means they are positively correlated and have a strong correlation. A score less than 1 means they are negatively correlated and are unlikely to appear together.

In association rule mining the Apriori algorithm is often used as the basis for identifing frequent itemsets and generating rules. The Apriori algorithm applies the Apriori property when combing a dataset for rules: if an item appears frequently in a dataset then all of subsets must also be frequent. Conversely, if an item is infrequent, all of its supersets must be as well. By implementing this property the algorithm does not have to check every rule and can efficiently eliminate rules that will not meet the required minimum thresholds

Example of Transaction Data:

Statistical Representation:

Visualization of Apriori Algorithm

Snipit of Discretized Transaction Data (above)

Data Prep

As mentioned in the overview Association Rule Mining requires unlabeled transaction data. In order to fit this requirement significant adjustments had to be applied to the Ultra-running dataset. These adjustments included the removal of column names and the discretization of multiple quantitative and qualitative variables. The ARM analysis will be performed on the information from the gender, race distance, athlete age, athlete finishing time, athlete finishing position, race average temperature, and race elevation gain features.

Snipit of Non Discretized Record Data (above)

Link to Un-prepared Data

Link to Prepared Data

ARM Analysis

Upon performing Association Rule Mining in R with parameters: minimum support =0.05, minimum confidence =0.8, and minimum length =2. This resulted in 370 rules and the following results:

To the right is a table of the 15 highest lift scoring rules. As we can see there were multiple strong lift values of ~14 followed by multiple around 2 and 3. Suggesting very strong positive correlations between right hand of “5 - 10 hours” and the corresponding left hand of “<50k” and visa versa. In the context of the dataset this makes alot of sense as “<50k” is the shortest distance category in the data and “5-10” is considered to by a very fast race in the context of ultra-running. Similarly, some of the other rules with high lift contain “151-180km” on the left and “20-25” hours on the right both correspond to longer and slower races. It is reasonable that these items are positively correlated and very likely to occur together

To the right is a network of the 15 rules sorted by lift. The size of each circle also corresponds to the support of each rule. This clearly shows that the two rules with the highest lift also have very low support, this suggests that these phrases rarely occur in the dataset but that when they do occur they are likely together.

From the same analysis the table to the right now shows the top 15 highest confidence scoring rules. This table is less informative than that of the lift rules as every rule has a confidence score of 1. Upon further inspection 19 rules represent a confidence score of 1. This means that given the left side, the right side is guaranteed to occur. In the context of the dataset, the following rules seem logical. Looking at the first rule, every time a race takes 25 to 30 hours the race is between 151-180km. This makes sense as 151-180 km is the longest distance category and 25 to 30 hours is a long time to be competing in a running race.

To the right is a network of the rules with the highest confidence score. As before the size of the circle represents the support of the rule. Now, there is also strong support for multiple rules - including the “25 to 30 hours” to “151-180km” just discussed. This is further evidence for the strength of this rule. This indicates that this pairing is also frequent in the dataset.

Finally, to the right is a table displaying the 15 highest scoring rules according to support. This table includes multiple rules that occur within over 50% of the data. Some of the relationships were mentioned previously such as the relationship between long races and long racing times. Additionally there the highest support rules are between “151-180km” and “Male”. In other words, over 50% of the transactions are from men who competed in the longest distance category.

Conclusions

Link to Code

An in-depth analysis of the rules produced through association rule mining has provided impactful information as it pertains to the ultra-running dataset. There was clear support for and against multiple theories / hypothesis. For example, this analysis helped to confirm that indeed shorter races will be (typically faster). However, evidence was also introduced contrary to the hypothesis that women are more prevalent and more competitive in longer races. While, this process does not provide definitive answers, it is very useful in highlighting relationships that should be further examined.