Association rules, frequent itemsets, the A-Priori Algorithm

By | December 14, 2015 | 400 Views

Introduction

In data mining, association rules is to extract frequent set of items from data presented as an implication expression of the form X{right}Y , where X {union} Y=varnothing. If all of the items in X appear in some baskets, then Y is ‘likely’ to appear in that baskets as well. The strength of the ‘likely’ can be measured by support and confidence terms.

Support indicates how often a rule is applicable to a given data. It is also used to eliminate uninteresting rules. For example, this is because a rule having very low support is likely to be unprofitable from a business perspective.

Support, s(X {right} Y) = {support(X {union} Y)}/N

Confidence measures the reliability of the inference made by a rule; provides an estimate of the conditional probability of Y given X.

Confidence, c(X {right} Y) = {support(X {union} Y)}/{support(X)}

In market base analysis, suppose that a manager want to know about buying habit of customer from the retail data on a store, which enables him to plan marketing or advertising strategies, or in the design of a new brands more effectively. For instance, the information that customers who purchase smartphones also tend to buy phone case at the same time. This pattern is represented in the following association rule:

smartphone ==> phone_case [ support = 5%, confidence = 70%]

Both support and confidence respectively reflect the usefulness and certainly of discovery rule. A support of 5% for the rule above means that 5% of all the transactions under analysis reveal that smartphone and phone case are purchased together. A confidence of 70% means that 70% of the customers who purchased a smartphone also bought the phone case.

Association rules analysis can be also used to mine many other kinds of data in different domains such as Web mining, bioinformatics, medical diagnosis, and scientific data analysis. In medical diagnosis, for example, a set of itemsets about a patient: their blood-chemistry, and medical history of disease is considered as a basket. Finding association rules means that one disease of being involved with biomarkers such as genes or blood proteins, and diseases.

Association Rule Discovery

To begin with a simplest question, find all the itemsets that appear together “frequently” in basket; only do this if the rules have support minsup (minimum support thresholds) and confidence minconf (minimum confidence thresholds). Thus, sets of items that appear in at least minsup are called frequent itemsets

There are two-steps approach to mine association rules.

  1. Find all frequent itemset whose support minsup
  2. Generate strong association rules from the frequent itemsets that must satisfy minsup and minconf

For many frequent-itemsets algorithms, main-memory bottleneck is critically concerned because we read baskets, and then count a huge number of different itemsets as well as swap counts in/out. Therefore, frequent itemset generation is still computationally expensive.

A first candidate to resolve these problems is A-Priori algorithm (R.Agrawal and R.Srikant, 1994). A basic idea of A-Priori is all nonempty subsets of a frequent itemset must also be frequent. Basing on the idea, a two-step process is employed as generate and prune actions:

  1. Candidate generation: This step is to generate a new set of candidate k-itemsets by join k-1 itemset itself found in the previous iteration.
  2. Candidate Pruning: This step eliminates infrequent candidate k-itemsets depending on the support-based pruning strategy.

A-Priori algorithm: Find frequent itemsets using an iterative level-wise approach based on candidate generation

  1. A transactions in database D are scanned to determine frequent 1-itemsets, {L_1} by comparing with minsup
  2. Generate candidate k-itemset {C_k} from the two joining k-1 itemsets, {L_{k-1}} and remove its infrequent subset.
  3. Scan D to get support count for each k-itemsets, {C_k}.
  4. The set of frequent k-itemsets,{L_k} is then determined. {L_k}  results from support count of  candidate k-1 itemset minsup
  5. Back to step 2 until there is no candidate k+1 itemsets, {C_{k+1}}
  6. Extract the frequent k-itemsets, L = {L_k}

Generate association rules from the frequent itemsets

It must be assumed that we found the frequent itemsets from transaction in a database D that meet a threshold of support, and extract support calculated for each of these itemsets. Thus, we can find all association rules as follows:

  • For each frequent itemset L, generate all nonempty subsets of L.
  • For each nonempty subset s of L, output the rule math_968_a4ff9115e35c5553867b9d1338ed67fa

Transaction data for E-commerce site:

TID-A-Prori

A-Prior demo

Figure 1: Generation of the candidate itemsets and frequent itemsets, where the minimum  support count is 2 (*).

Implementation

A simple code version implemented by Java to illustrate how the A-Prior algorithm works as following description:

public Map<Set<T>, Integer> generateFrequentItemSets(List<Set<T>> transactionList, int minSupport) {

		Map<Set<T>, Integer> supportCountMap = new HashMap<Set<T>, Integer>();
		// Find all frequent 1-item sets
		Map<Set<T>, Integer> frequent1ItemMap = findFrequent1Itemsets(transactionList, minSupport);
		List<Set<T>> frequentItemList = new ArrayList<Set<T>>(frequent1ItemMap.keySet());

		Map<Integer, List<Set<T>>> map = new HashMap<Integer, List<Set<T>>>();
		map.put(1, frequentItemList);

		int k = 1;
		for (k = 2; !map.get(k - 1).isEmpty(); k++) {

			// First generate the candidates.
			List<Set<T>> candidateList = aprioriGenerate(map.get(k - 1));
			// Scan D for counts
			for (Set<T> transaction : transactionList) {
				// Get the subsets of t that are candidates
				List<Set<T>> candidateList2 = subSets(candidateList, transaction);

				for (Set<T> itemset : candidateList2) {
					// Increase support count
					int count = supportCountMap.get(itemset) == null ? 1 : supportCountMap.get(itemset) + 1;
					supportCountMap.put(itemset, count);
				}
			}
			// Generate next possible frequent candidates
			map.put(k, extractNextFrequentCandidates(candidateList, supportCountMap, minSupport));
		}
		return getFrequentItemsets(map, supportCountMap, frequent1ItemMap);
	}


The aprioriGenerate function generate candidate itemsets

public List<Set<T>> aprioriGenerate(List<Set<T>> frequentItemSets) {

		List<Set<T>> candidatesGen = new ArrayList<Set<T>>();
		// Make sure that items within a transaction or itemset are sorted in
		// lexicographic order
		List<List<T>> sortedList = sortList(frequentItemSets);
		// Generate itemSet from L(k-1)
		for (int i = 0; i < sortedList.size(); ++i) {
			for (int j = i + 1; j < sortedList.size(); ++j) {
				// Check condition L(k-1) joining with itself
				if (isJoinable(sortedList.get(i), sortedList.get(j))) {
					// join step: generate candidates
					Set<T> candidate = tryJoinItemSets(sortedList.get(i), sortedList.get(j));
					if (hasFrequentSubSet(candidate, frequentItemSets)) {
						// Add this candidate to C(k)
						candidatesGen.add(candidate);
					}
				}
			}
		}
		return candidatesGen;
	}


The apGenrules function generate association rules basing the frequent itemsets

public void apGenrules(Set<T> fk, Set<T> hm, Map<Set<T>, Integer> frequentItemCounts, Map<Set<T>, Set<T>> rules,
			double minConf) {
		Combination<T> c = new Combination<T>();
		List<T> list = new ArrayList<T>(fk);

		int k = fk.size() - 1;
		if (k > 0) {
			// Generate all nonempty subsets of fk
			Set<List<T>> subsets = c.combination(list, k);
			for (List<T> subset : subsets) {
				// For every nonempty subset s of fk, output the rule s->fk-s
				Set<T> s = new HashSet<T>(subset);
				Set<T> ls = new HashSet<T>(hm);
				ls.removeAll(subset);
				// Avoid duplicated generate rule
				if (!rules.containsKey(s)) {
					double conf = frequentItemCounts.get(fk) / (double) frequentItemCounts.get(s);
					// Check support of the minimum confidence threshold
					if (conf > minConf) {
						System.out.println(subset + "->" + ls + " confidence = " + conf);
					}
					// Keep tracking the existing rules generated
					rules.put(s, ls);
					subsets.removeAll(s);
					// Call apGenrules recursive
					apGenrules(s, hm, frequentItemCounts, rules, minConf);
				}
			}
		}
	}


With minConf = 0.7 and minSup =2. See the following output in the console :

[I1, I2] 4
[I1, I3] 4
[I2, I3] 4
[I1, I5] 2
[I2, I4] 2
[I2, I5] 2
[I1, I2, I3] 2
[I1, I2, I5] 2
[I5]->[I1] confidence = 1.0
[I4]->[I2] confidence = 1.0
[I5]->[I2] confidence = 1.0
[I1, I5]->[I2] confidence = 1.0
[I5]->[I1, I2] confidence = 1.0
[I2, I5]->[I1] confidence = 1.0

Code found here: APrioriServiceImpl


References:

  1. Rakesh Agrawal and Ramakrishnan Srikant, Fast Algorithms for Mining Association Rules, IBM Almaden Research Center, 1994.
  2. P. N. Tan, M. Steinbach, and V. Kumar, “Association Analysis: Basic Concepts and Algorithms”, In Introduction to Data Mining, pp.327-396. Addison-Wesley,March 25, 2006.
  3. J. Leskovec, A. Rajaraman, and J. D. Ullman, “Frequent Itemsets”, In Mining of Massive Datasets, , pp.201-238. Cambridge University Press, 2012.
  4. J. Han, M. Kamber, and J. Pei, “Mining Frequent Pattern, Assocations, and Correlations:Basic Concepts and Methods”, In Data Mining – Concepts and Techniques, pp. 243 – 273. Morgan Kaufmann Publisher, Third Edition, 2012.

5 thoughts on “Association rules, frequent itemsets, the A-Priori Algorithm

  1. Williamhew

    I really like and appreciate your article post.Really looking forward to read more. Fantastic.

    Reply
  2. Atlantis

    I tries to run your code. It shows error with the Combination.Why is that ? Can you please provide the source.txt file? It will be a great help…thank you

    Reply
  3. Atlantis

    Thank you so much for the help. It works fine now..

    Reply
    1. khiem Post author

      Thank for your kind of words.
      I think you should use a FP-growth (finding frequent itemsets without candidate generation). This is because the Apriori approach still needs to generate a huge number of candidate sets. For instance, if there are 10^4 frequent 1-itemset, the Apriori algorithm will generate more than 10^7 candiate 2-itemset. Moreover, it extremely costs once scanning each transaction in the database to determine the support of the candidate itemsets.

      Reply

Leave a Reply

Your email address will not be published. Required fields are marked *