**Introduction**

In data mining, **association rules** is to extract frequent set of items from data presented as an implication expression of the form , where . If all of the items in X appear in some baskets, then Y is ‘likely’ to appear in that baskets as well. The strength of the ‘likely’ can be measured by support and confidence terms.

**Support **indicates how often a rule is applicable to a given data. It is also used to eliminate uninteresting rules. For example, this is because a rule having very low support is likely to be unprofitable from a business perspective.

Support,

**Confidence** measures the reliability of the inference made by a rule; provides an estimate of the conditional probability of Y given X.

Confidence,

In market base analysis, suppose that a manager want to know about buying habit of customer from the retail data on a store, which enables him to plan marketing or advertising strategies, or in the design of a new brands more effectively. For instance, the information that customers who purchase smartphones also tend to buy phone case at the same time. This pattern is represented in the following association rule:

*smartphone ==> phone_case [ support = 5%, confidence = 70%]*

Both support and confidence respectively reflect the usefulness and certainly of discovery rule. A support of 5% for the rule above means that 5% of all the transactions under analysis reveal that smartphone and phone case are purchased together. A confidence of 70% means that 70% of the customers who purchased a smartphone also bought the phone case.

Association rules analysis can be also used to mine many other kinds of data in different domains such as Web mining, bioinformatics, medical diagnosis, and scientific data analysis. In medical diagnosis, for example, a set of itemsets about a patient: their blood-chemistry, and medical history of disease is considered as a basket. Finding association rules means that one disease of being involved with *biomarkers *such as genes or blood proteins, and diseases.

**Association Rule Discovery**

To begin with a simplest question, find all the itemsets that appear together “frequently” in basket; only do this if the rules have support *minsup *(minimum support thresholds) and confidence *minconf *(minimum confidence thresholds). Thus, sets of items that appear in at least *minsup *are called **frequent itemsets**

There are two-steps approach to mine association rules.

- Find all frequent itemset whose
*support**minsup* - Generate strong association rules from the frequent itemsets that must satisfy
*minsup*and*minconf*

For many frequent-itemsets algorithms, main-memory bottleneck is critically concerned because we read baskets, and then count a huge number of different itemsets as well as swap counts in/out. Therefore, frequent itemset generation is still computationally expensive.

A first candidate to resolve these problems is *A-Priori algorithm* (R.Agrawal and R.Srikant, 1994). A basic idea of A-Priori is all nonempty subsets of a frequent itemset must also be frequent. Basing on the idea, a two-step process is employed as generate and prune actions:

**Candidate generation**: This step is to generate a new set of candidate k-itemsets by join k-1 itemset itself found in the previous iteration.**Candidate Pruning**: This step eliminates infrequent candidate k-itemsets depending on the support-based pruning strategy.

**A-Priori algorithm**: Find frequent itemsets using an iterative level-wise approach based on candidate generation

- A transactions in database D are scanned to determine frequent 1-itemsets, by comparing with minsup
- Generate candidate k-itemset from the two joining k-1 itemsets, and remove its infrequent subset.
- Scan D to get support count for each k-itemsets, .
- The set of frequent k-itemsets, is then determined. results from support count of candidate k-1 itemset minsup
- Back to step 2 until there is no candidate k+1 itemsets,
- Extract the frequent k-itemsets, L =

**Generate association rules from the frequent itemsets**

It must be assumed that we found the frequent itemsets from transaction in a database D that meet a threshold of support, and extract support calculated for each of these itemsets. Thus, we can find all association rules as follows:

- For each frequent itemset L, generate all nonempty subsets of L.
- For each nonempty subset s of L, output the rule

Transaction data for E-commerce site:

**Figure 1**: Generation of the candidate itemsets and frequent itemsets, where the minimum support count is 2 (*).

**Implementation**

A simple code version implemented by Java to illustrate how the A-Prior algorithm works as following description:

public Map<Set<T>, Integer> generateFrequentItemSets(List<Set<T>> transactionList, int minSupport) { Map<Set<T>, Integer> supportCountMap = new HashMap<Set<T>, Integer>(); // Find all frequent 1-item sets Map<Set<T>, Integer> frequent1ItemMap = findFrequent1Itemsets(transactionList, minSupport); List<Set<T>> frequentItemList = new ArrayList<Set<T>>(frequent1ItemMap.keySet()); Map<Integer, List<Set<T>>> map = new HashMap<Integer, List<Set<T>>>(); map.put(1, frequentItemList); int k = 1; for (k = 2; !map.get(k - 1).isEmpty(); k++) { // First generate the candidates. List<Set<T>> candidateList = aprioriGenerate(map.get(k - 1)); // Scan D for counts for (Set<T> transaction : transactionList) { // Get the subsets of t that are candidates List<Set<T>> candidateList2 = subSets(candidateList, transaction); for (Set<T> itemset : candidateList2) { // Increase support count int count = supportCountMap.get(itemset) == null ? 1 : supportCountMap.get(itemset) + 1; supportCountMap.put(itemset, count); } } // Generate next possible frequent candidates map.put(k, extractNextFrequentCandidates(candidateList, supportCountMap, minSupport)); } return getFrequentItemsets(map, supportCountMap, frequent1ItemMap); }

The *aprioriGenerate *function generate candidate itemsets

public List<Set<T>> aprioriGenerate(List<Set<T>> frequentItemSets) { List<Set<T>> candidatesGen = new ArrayList<Set<T>>(); // Make sure that items within a transaction or itemset are sorted in // lexicographic order List<List<T>> sortedList = sortList(frequentItemSets); // Generate itemSet from L(k-1) for (int i = 0; i < sortedList.size(); ++i) { for (int j = i + 1; j < sortedList.size(); ++j) { // Check condition L(k-1) joining with itself if (isJoinable(sortedList.get(i), sortedList.get(j))) { // join step: generate candidates Set<T> candidate = tryJoinItemSets(sortedList.get(i), sortedList.get(j)); if (hasFrequentSubSet(candidate, frequentItemSets)) { // Add this candidate to C(k) candidatesGen.add(candidate); } } } } return candidatesGen; }

The *apGenrules* function generate association rules basing the frequent itemsets

public void apGenrules(Set<T> fk, Set<T> hm, Map<Set<T>, Integer> frequentItemCounts, Map<Set<T>, Set<T>> rules, double minConf) { Combination<T> c = new Combination<T>(); List<T> list = new ArrayList<T>(fk); int k = fk.size() - 1; if (k > 0) { // Generate all nonempty subsets of fk Set<List<T>> subsets = c.combination(list, k); for (List<T> subset : subsets) { // For every nonempty subset s of fk, output the rule s->fk-s Set<T> s = new HashSet<T>(subset); Set<T> ls = new HashSet<T>(hm); ls.removeAll(subset); // Avoid duplicated generate rule if (!rules.containsKey(s)) { double conf = frequentItemCounts.get(fk) / (double) frequentItemCounts.get(s); // Check support of the minimum confidence threshold if (conf > minConf) { System.out.println(subset + "->" + ls + " confidence = " + conf); } // Keep tracking the existing rules generated rules.put(s, ls); subsets.removeAll(s); // Call apGenrules recursive apGenrules(s, hm, frequentItemCounts, rules, minConf); } } } }

With minConf = 0.7 and minSup =2. See the following output in the console :

[I1, I2] 4 [I1, I3] 4 [I2, I3] 4 [I1, I5] 2 [I2, I4] 2 [I2, I5] 2 [I1, I2, I3] 2 [I1, I2, I5] 2 [I5]->[I1] confidence = 1.0 [I4]->[I2] confidence = 1.0 [I5]->[I2] confidence = 1.0 [I1, I5]->[I2] confidence = 1.0 [I5]->[I1, I2] confidence = 1.0 [I2, I5]->[I1] confidence = 1.0

Code found here: APrioriServiceImpl

**References**:

- Rakesh Agrawal and Ramakrishnan Srikant,
*Fast Algorithms for Mining Association Rules*, IBM Almaden Research Center, 1994. - P. N. Tan, M. Steinbach, and V. Kumar, “Association Analysis: Basic Concepts and Algorithms”, In
*Introduction to Data Mining*, pp.327-396. Addison-Wesley,March 25, 2006. - J. Leskovec, A. Rajaraman, and J. D. Ullman, “Frequent Itemsets”, In
*Mining of Massive Datasets*, , pp.201-238. Cambridge University Press, 2012. - J. Han, M. Kamber, and J. Pei, “Mining Frequent Pattern, Assocations, and Correlations:Basic Concepts and Methods”, In
*Data Mining – Concepts and Techniques*, pp. 243 – 273. Morgan Kaufmann Publisher, Third Edition, 2012.

I really like and appreciate your article post.Really looking forward to read more. Fantastic.

I tries to run your code. It shows error with the Combination.Why is that ? Can you please provide the source.txt file? It will be a great help…thank you

Sorry. I missed the Combination. You can refer to Combination.txt. Otherwise, You should write new one. Please see more details in Combinations in lexicographic order

An updated version that improves about Design Pattern and Performance will come soon.

Thank you so much for the help. It works fine now..

Thank for your kind of words.

I think you should use a FP-growth (finding frequent itemsets without candidate generation). This is because the Apriori approach still needs to generate a huge number of candidate sets. For instance, if there are 10^4 frequent 1-itemset, the Apriori algorithm will generate more than 10^7 candiate 2-itemset. Moreover, it extremely costs once scanning each transaction in the database to determine the support of the candidate itemsets.