Apriori Algorithm from Scratch | Building a Netflix Film Recommendation System
Association rule mining is a popular technique for uncovering relationships between items in large datasets. One of the most widely used algorithms in this field is the Apriori algorithm. Imagine using this algorithm to create a Netflix recommendation system that suggests movies based on the viewing habits of users. In this post, we’ll explain how the Apriori algorithm works, why it’s useful, and walk through its implementation in Python. We’ll also discuss its pros and cons, and explore some related algorithms like FP-Growth and Eclat.
Introduction
The Apriori algorithm is a fundamental technique in association rule mining, often used to analyze transaction data and find frequent itemsets. This algorithm works by discovering relationships in data, such as which items are commonly purchased together. In the case of Netflix, we can use Apriori to recommend movies to users based on what others with similar viewing histories have watched.
Apriori algorithm finds frequent itemsets by iteratively scanning through transaction data and pruning infrequent items. Once frequent itemsets are identified, association rules can be generated to predict which items (or movies, in our case) are likely to occur together.
In this blog post, we’ll break down the Apriori algorithm step by step, use a movie recommendation system as an example, and implement the algorithm from scratch. We’ll also explore when and why you might use Apriori, its strengths and weaknesses, and conclude with a look at related algorithms. In a Netflix movie recommendation scenario, Apriori could help identify patterns like “If a user watches ‘The Matrix,’ they are also likely to watch ‘Inception’.”
Mathematical Foundations of Apriori
The Apriori algorithm operates on 3 key metrics: support, confidence, lift. Selecting appropriate values for minimum support, minimum confidence, and lift is crucial in association rule mining to balance finding meaningful rules and managing computational efficiency. Let’s break them down:
\[\text{Support}(X) = \frac{\text{Number of transactions containing } X}{\text{Total number of transactions}}\]Support measures how often an item or itemset appears in the dataset. It helps filter out infrequent items.
\[\text{Confidence}(X \rightarrow Y) = \frac{\text{Support}(X, Y)}{\text{Support}(X)}\]Confidence measures the likelihood of a consequent item appearing given the antecedent item. This is essential for generating reliable rules.
\[\text{Lift}(X \rightarrow Y) = \frac{\text{Confidence}(X \rightarrow Y)}{\text{Support}(Y)} = \frac{0.75}{0.8} = 0.9375\]Lift measures how much more likely item YY is purchased when item XX is purchased compared to its usual purchase rate. Let’s calculate the lift for $A \rightarrow B$.
These metrics are used to generate association rules, such as recommending movies to Netflix users based on past viewing habits.
Netflix Example: Movie Recommendation System
To illustrate how the Apriori algorithm works, let’s consider an example using five hypothetical Netflix transactions where users watched various movies:
Transaction ID (User ID) | Movies Watched |
---|---|
1 | The Matrix, Inception, Interstellar |
2 | Inception, Interstellar |
3 | The Matrix, Interstellar |
4 | The Matrix, Inception |
5 | Inception, Interstellar |
Our goal is to discover frequent itemsets and generate rules like “If a user watched ‘The Matrix’, they are likely to watch ‘Inception’.”
Step-by-Step Breakdown of the Apriori Algorithm
Step 1: Define Itemsets
Let’s define the following itemsets and rules:
- Items: ${set(The Matrix, Inception, Interstellar)}$
- Itemsets:
- Single items: ${set(The Matrix)}, set({Inception}), set({Interstellar}) $
- Pairs: $set({The Matrix, Inception}), set({The Matrix, Interstellar}), set({Inception, Interstellar})$
Step 2: Calculate Support
Support is the proportion of transactions that contain the itemset. Minimum support filters out infrequent itemsets, with typical values ranging from 0.1 to 0.5, with common choices around 0.2 to 0.3.
Support for Single Items:
\[\text{Support(The Matrix)} = \frac{\text{Number of transactions containing The Matrix}}{\text{Total number of transactions}} = \frac{3}{5} = 0.6\] \[\text{Support(Inception)} = \frac{\text{Number of transactions containing Inception}}{\text{Total number of transactions}} = \frac{4}{5} = 0.8\] \[\text{Support(Interstellar)} = \frac{\text{Number of transactions containing Interstellar}}{\text{Total number of transactions}} = \frac{4}{5} = 0.8\]Support for Item Pairs:
\[\text{Support(The Matrix, Inception)} = \frac{\text{Number of transactions containing The Matrix and Inception}}{\text{Total number of transactions}} = \frac{2}{5} = 0.4\] \[\text{Support(The Matrix, Interstellar)} = \frac{\text{Number of transactions containing The Matrix and Interstellar}}{\text{Total number of transactions}} = \frac{3}{5} = 0.6\] \[\text{Support(Inception, Interstellar)} = \frac{\text{Number of transactions containing Inception and Interstellar}}{\text{Total number of transactions}} = \frac{3}{5} = 0.6\]Step 3: Calculate Confidence
\[\text{Confidence}(The Matrix \rightarrow Inception) = \frac{\text{Support(The Matrix, Inception)}}{\text{Support(The Matrix)}} = \frac{0.4}{0.6} = 0.67\] \[\text{Confidence}(The Matrix \rightarrow Interstellar) = \frac{\text{Support(The Matrix, Interstellar)}}{\text{Support(The Matrix)}} = \frac{0.6}{0.6} = 1.0\] \[\text{Confidence}(Inception \rightarrow Interstellar) = \frac{\text{Support(Inception, Interstellar)}}{\text{Support(Inception)}} = \frac{0.6}{0.8} = 0.75\]Confidence measures how often items in $Y$ appear in transactions that contain $X$. Minimum confidence measures rule reliability, usually set between 0.5 and 0.9, with 0.7 or 0.8 often used for strong predictive power.
Step 4: Calculate Lift
\[\text{Lift}(The Matrix \rightarrow Inception) = \frac{\text{Confidence}(The Matrix \rightarrow Inception)}{\text{Support(Inception)}} = \frac{0.67}{0.8} = 0.84\] \[\text{Lift}(The Matrix \rightarrow Interstellar)= \frac{\text{Confidence}(The Matrix \rightarrow Interstellar)}{\text{Support(Interstellar)}} = \frac{1.0}{0.8} = 1.25\] \[\text{Lift}(Inception \rightarrow Interstellar) = \frac{\text{Confidence}(Inception \rightarrow Interstellar)}{\text{Support(Interstellar)}} = \frac{0.75}{0.8} = 0.9375\]Lift measures the strength of a rule over the baseline probability of occurrence of $Y$. Lift evaluates the strength of associations, with values above 1 indicating positive correlations, and significantly higher values reflecting stronger associations.
Step 5: Generating Association Rules
Using the confidence and lift values, the association rule $\text{The Matrix} \rightarrow \text{Inception}$ suggests that if a user watches Movie $\text{The Matrix}$, there is a 67% chance they will watch Movie $\text{Inception}$, and the lift indicates this is slightly less than expected by chance (since $Lift < 1$).
Wrap it up!
In summary, the Apriori algorithm is a powerful tool for uncovering associations such as movie recommendations on Netflix. By setting appropriate thresholds for minimum support, minimum confidence, and lift, you can effectively manage the balance between discovering meaningful rules and optimizing computational resources. Understanding and adjusting these parameters based on your dataset’s size and domain-specific requirements is key to deriving valuable insights.Exploring the Apriori algorithm and its parameter settings can lead to actionable patterns and recommendations, enhancing user experiences and business strategies. As you implement and refine your association rule mining approach, consider experimenting with different thresholds and comparing results with other algorithms, such as FP-Growth or Eclat, to find the most effective method for your specific needs. The insights gained from these analyses not only improve recommendation systems but also provide a deeper understanding of user behaviors and preferences.
Apriori from Scratch in Python
The provided Python code implements the Apriori algorithm
to identify frequent itemsets and generate association rules from transactional data, such as movie recommendations. The Apriori class
initializes with minimum support and confidence thresholds and optionally an itemset size limit. Key methods include _find_unique_items_in_all_transactions
, which extracts unique items from transactions; _find_candidates
, which generates candidate itemsets of a specified size; and _calculate_support
, which computes the support for each candidate itemset. The algorithm iterates through itemset sizes, filtering frequent itemsets based on the minimum support. It then generates association rules from these itemsets, calculating their confidence and retaining those that meet the minimum confidence threshold. The fit
method orchestrates the entire process—preparing transactions, finding frequent itemsets, and generating rules—while get_frequent_itemsets
and get_rules
methods retrieve the results. This code highlights the Apriori algorithm’s ability to extract meaningful patterns from data, making it a powerful tool for recommendation systems and market basket analysis.
From Scratch
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
from itertools import combinations
class Apriori:
def __init__(self, min_support, min_confidence, itemset_limit=None):
self.min_support = min_support
self.min_confidence = min_confidence
self.itemset_limit = itemset_limit
self.frequent_itemsets = {} # Changed from set() to dict() to store support values
self.rules = []
def _find_unique_items_in_all_transactions(self):
unique_items = set(item for transaction in self.transactions_list for item in transaction)
return unique_items
def _find_candidates(self, elem_count):
# Generate itemset combinations of size `elem_count`
candidates = set(frozenset(itemset) for itemset in combinations(self.unique_items, elem_count))
return candidates
def _prepare_transactions(self, transactions):
self.transactions = transactions
self.transactions_list = [set(t) for t in transactions] # Convert each transaction to a set
self.unique_items = self._find_unique_items_in_all_transactions()
# Set itemset limit to the number of unique items if not provided
self.itemset_limit = len(self.unique_items) if self.itemset_limit is None else self.itemset_limit
def _calculate_support(self, itemset):
transaction_count = len(self.transactions_list)
subset_count = sum(1 for transaction in self.transactions_list if itemset.issubset(transaction))
support = subset_count / transaction_count
return support
def _filter_frequent_itemsets(self, candidates):
frequent_itemsets = {}
for candidate in candidates:
support = self._calculate_support(candidate)
if support >= self.min_support:
frequent_itemsets[candidate] = support
return frequent_itemsets
def _find_frequent_itemsets(self):
for k in range(1, self.itemset_limit + 1):
print("Checking itemsets of length:", k)
candidates = self._find_candidates(k)
curr_itemsets = self._filter_frequent_itemsets(candidates)
if not curr_itemsets:
print("No frequent itemsets found for size", k)
break # Exit if no frequent itemsets are found
print("Current frequent itemsets:", curr_itemsets)
self.frequent_itemsets.update(curr_itemsets)
def _generate_association_rules(self):
"""Generate all possible association rules from the frequent itemsets."""
for itemset, itemset_support in self.frequent_itemsets.items():
if len(itemset) < 2:
continue # No rules can be generated from single-item sets
# Generate all non-empty subsets of the itemset (possible antecedents)
for antecedent_size in range(1, len(itemset)):
for antecedent in combinations(itemset, antecedent_size):
antecedent = frozenset(antecedent)
consequent = itemset - antecedent
if consequent:
antecedent_support = self.frequent_itemsets.get(antecedent, 0)
if antecedent_support > 0: # To avoid division by zero
confidence = itemset_support / antecedent_support
if confidence >= self.min_confidence:
rule = {
'antecedent': antecedent,
'consequent': consequent,
'confidence': confidence,
'support': itemset_support
}
self.rules.append(rule)
def fit(self, transactions):
self._prepare_transactions(transactions)
self._find_frequent_itemsets()
self._generate_association_rules()
def get_frequent_itemsets(self):
if not self.frequent_itemsets:
print("Frequent itemsets not found. Please fit the model first.")
return self.frequent_itemsets
def get_rules(self):
if not self.rules:
print("No rules generated. Please fit the model and generate association rules.")
return self.rules
if __name__ == "__main__":
apr = Apriori(min_support=0.3, min_confidence=0.7)
transactions_by_name = [
['The Matrix', 'Inception', 'Interstellar'],
['Inception', 'Interstellar'],
['The Matrix', 'Interstellar'],
['The Matrix', 'Inception'],
['Inception', 'Interstellar']
]
apr.fit(transactions_by_genre)
frequent_itemsets = apr.get_frequent_itemsets()
rules = apr.get_rules()
print("Frequent Itemsets:", frequent_itemsets)
print("Association Rules:")
for rule in rules:
print(f"Rule: {rule['antecedent']} -> {rule['consequent']}, Confidence: {rule['confidence']:.2f}, Support: {rule['support']:.2f}")
With mlxtend
Firstly let’s install packages pip install pandas mlxtend
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, association_rules
# Sample transactions
transactions = [
['The Matrix', 'Inception', 'Interstellar'],
['Inception', 'Interstellar'],
['The Matrix', 'Interstellar'],
['The Matrix', 'Inception'],
['Inception', 'Interstellar']
]
# Convert transactions to DataFrame
te = TransactionEncoder()
te_ary = te.fit(transactions).transform(transactions)
df = pd.DataFrame(te_ary, columns=te.columns_)
# Apply Apriori algorithm
min_support = 0.3
frequent_itemsets = apriori(df, min_support=min_support, use_colnames=True)
# Generate association rules
min_confidence = 0.7
rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=min_confidence)
# Output results
print("Frequent Itemsets:")
print(frequent_itemsets)
print("\nAssociation Rules:")
for _, row in rules.iterrows():
print(f"Rule: {set(row['antecedents'])} -> {set(row['consequents'])}, Confidence: {row['confidence']:.2f}, Support: {row['support']:.2f}")