(Part 1)
Apriori Algorithm:
The algorithm works as follows: first it generates all the frequent itemsets of length 1 w.r.t. the given threshold (minimum support). Then it continue to generate itemsets of lengths 2,3, ... ,n if possible. There are lots of improvements and pruning possible in the implementation.
Do we need to sort frequently? First observation is, if the items are taken in sorted order on step k = 1, all the frequent patterns generated in future will also be in sorted order if maintained properly. So this eliminates the necessity of sorting or storing itemsets in logarithmic data structures like map or set. Rather, we can store them in arraylist or vector like data structure.
How do we generate itemsets of length k+1? We generate itemsets of length k+1 by merging two itemsets of length k. If two itemsets I1 and I2 have a common prefix of length k-1, and I1[k] < I2[k], we can take I1[1 ... k-1] I1[k] I2[k] which is an itemset of length k+1. As our itemsets are sorted, and frequent itemsets generated are also sorted, this can be done in O(N*K^2). Well, if we follow the naive approach, it will take O(N*K^3), however, as we can pre-calculate the length of common prefix of consecutive items in O(N*K), later, we can use this to perform joining operation stated above and also do early termination rather than looking for the entire N items. The approach is demonstrated in source code.
What are the prunings? The most important observation on apriori algorithm is, if an itemset is not frequent, none of its superset can be frequent. Which also tells us, if an itemset has to be frequent, all of its subset must be frequent. Well, the first one is automatically checked in the algorithm. As it takes the frequent itemsets of previous step to calculate itemsets of current step. Now, how the second rule is checked? Well, we don't need to check all the subsets really, all we need is to check the subsets of length k-1 for step k, which can be easily done by performing manual hashing, or using java hashmap. This saves a lots of hassles.
/*
Author: Zobayer Hasan
*/
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
public class Apriori extends Thread {
public static boolean debugger = false;
private final Database db;
private final List< Integer > itemset;
private final List< List< Integer > > frequent;
private double minsup;
public Apriori(String thrName, Database db, double minsup) {
super(thrName);
this.db = db;
itemset = db.getItemset();
frequent = new ArrayList< List< Integer > >();
this.minsup = minsup;
}
@Override
public void run() {
double startTime = System.currentTimeMillis();
int k = 1, n = db.dbSize();
List< List< Integer > > Ck = new ArrayList< List< Integer > >();
List< List< Integer > > Lk = new ArrayList< List< Integer > >();
HashMap< List< Integer>, Integer > seenK = new HashMap< List< Integer >, Integer >();
for(Integer item : itemset) {
List< Integer > temp = new ArrayList< Integer >();
temp.add(item);
Lk.add(temp);
}
while(k <= n && !Lk.isEmpty()) {
if(debugger) {
System.out.println("Step " + k);
System.out.println("Lk: " + Lk);
}
seenK.clear();
Ck.clear();
for(List< Integer > kth : Lk) {
int count = db.scanDatabase(kth);
if((double)count < Math.ceil(minsup * (double)n / 100.0)) continue;
Ck.add(kth);
}
if(debugger) {
System.out.println("Ck: " + Ck);
}
if(Ck.isEmpty()) break;
for(List< Integer > freq : Ck) {
frequent.add(freq);
seenK.put(freq, k);
}
int[] prefixlen = new int[Ck.size()];
prefixlen[0] = 0;
for(int i = 1; i < Ck.size(); i++) {
prefixlen[i] = prefixLen(Ck.get(i-1), Ck.get(i));
}
List< List< Integer > > temp = new ArrayList< List< Integer > >();
for(int i = 0; i < Ck.size(); i++) {
for(int j = i + 1; j < Ck.size(); j++) {
if(prefixlen[j] == k-1) {
if(debugger) {
System.out.println("Joining: " + i + ":" + Ck.get(i) + " + " + j + ":" + Ck.get(j) + " Prefix Length " + prefixlen[j]);
}
temp.add(prefixJoin(Ck.get(i), Ck.get(j)));
}
else break;
}
}
if(debugger) {
System.out.println("Temporary: " + temp);
}
Lk.clear();
for(List< Integer > list : temp) {
boolean candid = true;
if(k > 1) {
for(int i = 0; i < list.size(); i++) {
List< Integer > prev = new ArrayList< Integer >();
for(int j = 0; j < list.size(); j++) {
if(i != j) prev.add(list.get(j));
}
if(!seenK.containsKey(prev)) {
candid = false;
break;
}
}
}
if(candid) {
Lk.add(list);
}
}
if(debugger) {
System.out.println("Pruned: " + Lk);
}
k++;
}
double endTime = System.currentTimeMillis();
System.out.println("Apriori completed in " + (endTime - startTime)/1000.0 + " seconds");
}
public void printPatterns() {
System.out.println("Frequent Itemsets");
for(List< Integer > pattern : frequent) {
System.out.println(pattern);
}
System.out.println("Total " + frequent.size() + " itemsets");
}
private int prefixLen(List< Integer > left, List< Integer > right) {
int len = 0;
for(len = 0; len < left.size() && len < right.size(); len++) {
if(left.get(len).compareTo(right.get(len)) != 0) return len;
}
return len;
}
private List< Integer > prefixJoin(List< Integer > left, List< Integer > right) {
List< Integer > ret = new ArrayList< Integer >();
for(Integer i : left) {
ret.add(i);
}
ret.add(right.get(right.size() - 1));
return ret;
}
}
This class is threaded, so it is possible to take advantages of multicore processors.
A sample test class is shown here.
/*
Author: Zobayer Hasan
*/
public class FIMtest {
public static void main(String[] args) {
Database db = null;
try {
db = new Database("mushroom.dat");
} catch(Exception e) {
e.printStackTrace();
}
System.out.println("\nStarting Apriori");
Apriori test1 = new Apriori("test1", db, 40.0);
Apriori.debugger = true;
test1.start();
try {
test1.join();
test1.printPatterns();
} catch(Exception e) {
e.printStackTrace();
}
}
}
So, this is it, an efficient implementation of Apriori algorithm in java. It is possible that there are some lacking or mistakes in either source code or analysis. If you find any, please let me know.
Here is a sample run on mushroom for 40% minimum support.
Have fun...