Friday, December 14, 2012

Apriori in Java (Part 2)


(Part 1)

Apriori Algorithm:

The algorithm works as follows: first it generates all the frequent itemsets of length 1 w.r.t. the given threshold (minimum support). Then it continue to generate itemsets of lengths 2,3, ... ,n if possible. There are lots of improvements and pruning possible in the implementation.

Do we need to sort frequently? First observation is, if the items are taken in sorted order on step k = 1, all the frequent patterns generated in future will also be in sorted order if maintained properly. So this eliminates the necessity of sorting or storing itemsets in logarithmic data structures like map or set. Rather, we can store them in arraylist or vector like data structure.

How do we generate itemsets of length k+1? We generate itemsets of length k+1 by merging two itemsets of length k. If two itemsets I1 and I2 have a common prefix of length k-1, and I1[k] < I2[k], we can take I1[1 ... k-1] I1[k] I2[k] which is an itemset of length k+1. As our itemsets are sorted, and frequent itemsets generated are also sorted, this can be done in O(N*K^2). Well, if we follow the naive approach, it will take O(N*K^3), however, as we can pre-calculate the length of common prefix of consecutive items in O(N*K), later, we can use this to perform joining operation stated above and also do early termination rather than looking for the entire N items. The approach is demonstrated in source code.

What are the prunings? The most important observation on apriori algorithm is, if an itemset is not frequent, none of its superset can be frequent. Which also tells us, if an itemset has to be frequent, all of its subset must be frequent. Well, the first one is automatically checked in the algorithm. As it takes the frequent itemsets of previous step to calculate itemsets of current step. Now, how the second rule is checked? Well, we don't need to check all the subsets really, all we need is to check the subsets of length k-1 for step k, which can be easily done by performing manual hashing, or using java hashmap. This saves a lots of hassles.

/*
Author: Zobayer Hasan
*/
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;

public class Apriori extends Thread {
    public static boolean debugger = false;
    
    private final Database db;
    private final List< Integer > itemset;
    private final List< List< Integer > > frequent;
    private double minsup;

    public Apriori(String thrName, Database db, double minsup) {
        super(thrName);
        this.db = db;
        itemset = db.getItemset();
        frequent = new ArrayList< List< Integer > >();
        this.minsup = minsup;
    }
    
    @Override
    public void run() {
        double startTime = System.currentTimeMillis();
        
        int k = 1, n = db.dbSize();
        List< List< Integer > > Ck = new ArrayList< List< Integer > >();
        List< List< Integer > > Lk = new ArrayList< List< Integer > >();
        HashMap< List< Integer>, Integer > seenK = new HashMap< List< Integer >, Integer >();
        
        for(Integer item : itemset) {
            List< Integer > temp = new ArrayList< Integer >();
            temp.add(item);
            Lk.add(temp);
        }
        
        while(k <= n && !Lk.isEmpty()) {
            if(debugger) {
                System.out.println("Step " + k);
                System.out.println("Lk: " + Lk);
            }
            
            seenK.clear();
            Ck.clear();
            for(List< Integer > kth : Lk) {
                int count = db.scanDatabase(kth);
                if((double)count < Math.ceil(minsup * (double)n / 100.0)) continue;
                Ck.add(kth);
            }
            
            if(debugger) {
                System.out.println("Ck: " + Ck);
            }
            
            if(Ck.isEmpty()) break;
            
            for(List< Integer > freq : Ck) {
                frequent.add(freq);
                seenK.put(freq, k);
            }
            
            int[] prefixlen = new int[Ck.size()];
            prefixlen[0] = 0;
            for(int i = 1; i < Ck.size(); i++) {
                prefixlen[i] = prefixLen(Ck.get(i-1), Ck.get(i));
            }
            
            List< List< Integer > > temp = new ArrayList< List< Integer > >();
            for(int i = 0; i < Ck.size(); i++) {
                for(int j = i + 1; j < Ck.size(); j++) {
                    if(prefixlen[j] == k-1) {
                        if(debugger) {
                            System.out.println("Joining: " + i + ":" + Ck.get(i) + " + " + j + ":" + Ck.get(j) + " Prefix Length " + prefixlen[j]);
                        }
                        temp.add(prefixJoin(Ck.get(i), Ck.get(j)));
                    }
                    else break;
                }
            }
            
            if(debugger) {
                System.out.println("Temporary: " + temp);
            }

            Lk.clear();
            for(List< Integer > list : temp) {
                boolean candid = true;
                if(k > 1) {
                    for(int i = 0; i < list.size(); i++) {
                        List< Integer > prev = new ArrayList< Integer >();
                        for(int j = 0; j < list.size(); j++) {
                            if(i != j) prev.add(list.get(j));
                        }
                        if(!seenK.containsKey(prev)) {
                            candid = false;
                            break;
                        }
                    }
                }
                if(candid) {
                    Lk.add(list);
                }
            }
            
            if(debugger) {
                System.out.println("Pruned: " + Lk);
            }
            
            k++;
        }
        
        double endTime = System.currentTimeMillis();
        System.out.println("Apriori completed in " + (endTime - startTime)/1000.0 + " seconds");
    }
    
    public void printPatterns() {
        System.out.println("Frequent Itemsets");
        for(List< Integer > pattern : frequent) {
            System.out.println(pattern);
        }
        System.out.println("Total " + frequent.size() + " itemsets");
    }
    
    private int prefixLen(List< Integer > left, List< Integer > right) {
        int len = 0;
        for(len = 0; len < left.size() && len < right.size(); len++) {
            if(left.get(len).compareTo(right.get(len)) != 0) return len;
        }
        return len;
    }
    
    private List< Integer > prefixJoin(List< Integer > left, List< Integer > right) {
        List< Integer > ret = new ArrayList< Integer >();
        for(Integer i : left) {
            ret.add(i);
        }
        ret.add(right.get(right.size() - 1));
        return ret;
    }
}
This class is threaded, so it is possible to take advantages of multicore processors.

A sample test class is shown here.
/*
Author: Zobayer Hasan
*/
public class FIMtest {

    public static void main(String[] args) {
        Database db = null;
        try {
            db = new Database("mushroom.dat");
            
        } catch(Exception e) {
            e.printStackTrace();
        }
        
        System.out.println("\nStarting Apriori");
        
        Apriori test1 = new Apriori("test1", db, 40.0);
        Apriori.debugger = true;
        test1.start();
        try {
            test1.join();
            test1.printPatterns();
        } catch(Exception e) {
            e.printStackTrace();
        }
    }
}

So, this is it, an efficient implementation of Apriori algorithm in java. It is possible that there are some lacking or mistakes in either source code or analysis. If you find any, please let me know. Here is a sample run on mushroom for 40% minimum support.

Have fun...

5 comments:

  1. you have to put mushroom.dat file for peaple to download

    ReplyDelete
    Replies
    1. The link is in part 1 of this post.
      http://fimi.ua.ac.be/data/mushroom.dat

      Delete
  2. This is frequent Itemset generation.The rule generation is still not present and which is the main part of the apriori algorithm

    ReplyDelete
    Replies
    1. Well, from what I remember since its already 5 years, "rule generation is still not present and which is the main part of the apriori algorithm" this is not correct. Frequent itemset mining is composed of Frequent Itemset Generation and Association Rule Generation. We use apriori for the first part, a.k.a Frequent Itemset Generation. Then you use the generated frequent itemset to generate the association rules.

      Anyway, thanks for your comment :)

      Delete
  3. This is frequent Itemset generation method.Apriori association rule generation is missing.Great work though.

    ReplyDelete