I, ME AND MYSELF !!!: lab

Showing posts with label lab. Show all posts

Friday, December 14, 2012

Apriori in Java (Part 2)

Apriori Algorithm:

The algorithm works as follows: first it generates all the frequent itemsets of length 1 w.r.t. the given threshold (minimum support). Then it continue to generate itemsets of lengths 2,3, ... ,n if possible. There are lots of improvements and pruning possible in the implementation.

Do we need to sort frequently? First observation is, if the items are taken in sorted order on step k = 1, all the frequent patterns generated in future will also be in sorted order if maintained properly. So this eliminates the necessity of sorting or storing itemsets in logarithmic data structures like map or set. Rather, we can store them in arraylist or vector like data structure.

How do we generate itemsets of length k+1? We generate itemsets of length k+1 by merging two itemsets of length k. If two itemsets I1 and I2 have a common prefix of length k-1, and I1[k] < I2[k], we can take I1[1 ... k-1] I1[k] I2[k] which is an itemset of length k+1. As our itemsets are sorted, and frequent itemsets generated are also sorted, this can be done in O(N*K^2). Well, if we follow the naive approach, it will take O(N*K^3), however, as we can pre-calculate the length of common prefix of consecutive items in O(N*K), later, we can use this to perform joining operation stated above and also do early termination rather than looking for the entire N items. The approach is demonstrated in source code.

What are the prunings? The most important observation on apriori algorithm is, if an itemset is not frequent, none of its superset can be frequent. Which also tells us, if an itemset has to be frequent, all of its subset must be frequent. Well, the first one is automatically checked in the algorithm. As it takes the frequent itemsets of previous step to calculate itemsets of current step. Now, how the second rule is checked? Well, we don't need to check all the subsets really, all we need is to check the subsets of length k-1 for step k, which can be easily done by performing manual hashing, or using java hashmap. This saves a lots of hassles.

/*
Author: Zobayer Hasan
*/
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;

public class Apriori extends Thread {
    public static boolean debugger = false;
    
    private final Database db;
    private final List< Integer > itemset;
    private final List< List< Integer > > frequent;
    private double minsup;

    public Apriori(String thrName, Database db, double minsup) {
        super(thrName);
        this.db = db;
        itemset = db.getItemset();
        frequent = new ArrayList< List< Integer > >();
        this.minsup = minsup;
    }
    
    @Override
    public void run() {
        double startTime = System.currentTimeMillis();
        
        int k = 1, n = db.dbSize();
        List< List< Integer > > Ck = new ArrayList< List< Integer > >();
        List< List< Integer > > Lk = new ArrayList< List< Integer > >();
        HashMap< List< Integer>, Integer > seenK = new HashMap< List< Integer >, Integer >();
        
        for(Integer item : itemset) {
            List< Integer > temp = new ArrayList< Integer >();
            temp.add(item);
            Lk.add(temp);
        }
        
        while(k <= n && !Lk.isEmpty()) {
            if(debugger) {
                System.out.println("Step " + k);
                System.out.println("Lk: " + Lk);
            }
            
            seenK.clear();
            Ck.clear();
            for(List< Integer > kth : Lk) {
                int count = db.scanDatabase(kth);
                if((double)count < Math.ceil(minsup * (double)n / 100.0)) continue;
                Ck.add(kth);
            }
            
            if(debugger) {
                System.out.println("Ck: " + Ck);
            }
            
            if(Ck.isEmpty()) break;
            
            for(List< Integer > freq : Ck) {
                frequent.add(freq);
                seenK.put(freq, k);
            }
            
            int[] prefixlen = new int[Ck.size()];
            prefixlen[0] = 0;
            for(int i = 1; i < Ck.size(); i++) {
                prefixlen[i] = prefixLen(Ck.get(i-1), Ck.get(i));
            }
            
            List< List< Integer > > temp = new ArrayList< List< Integer > >();
            for(int i = 0; i < Ck.size(); i++) {
                for(int j = i + 1; j < Ck.size(); j++) {
                    if(prefixlen[j] == k-1) {
                        if(debugger) {
                            System.out.println("Joining: " + i + ":" + Ck.get(i) + " + " + j + ":" + Ck.get(j) + " Prefix Length " + prefixlen[j]);
                        }
                        temp.add(prefixJoin(Ck.get(i), Ck.get(j)));
                    }
                    else break;
                }
            }
            
            if(debugger) {
                System.out.println("Temporary: " + temp);
            }

            Lk.clear();
            for(List< Integer > list : temp) {
                boolean candid = true;
                if(k > 1) {
                    for(int i = 0; i < list.size(); i++) {
                        List< Integer > prev = new ArrayList< Integer >();
                        for(int j = 0; j < list.size(); j++) {
                            if(i != j) prev.add(list.get(j));
                        }
                        if(!seenK.containsKey(prev)) {
                            candid = false;
                            break;
                        }
                    }
                }
                if(candid) {
                    Lk.add(list);
                }
            }
            
            if(debugger) {
                System.out.println("Pruned: " + Lk);
            }
            
            k++;
        }
        
        double endTime = System.currentTimeMillis();
        System.out.println("Apriori completed in " + (endTime - startTime)/1000.0 + " seconds");
    }
    
    public void printPatterns() {
        System.out.println("Frequent Itemsets");
        for(List< Integer > pattern : frequent) {
            System.out.println(pattern);
        }
        System.out.println("Total " + frequent.size() + " itemsets");
    }
    
    private int prefixLen(List< Integer > left, List< Integer > right) {
        int len = 0;
        for(len = 0; len < left.size() && len < right.size(); len++) {
            if(left.get(len).compareTo(right.get(len)) != 0) return len;
        }
        return len;
    }
    
    private List< Integer > prefixJoin(List< Integer > left, List< Integer > right) {
        List< Integer > ret = new ArrayList< Integer >();
        for(Integer i : left) {
            ret.add(i);
        }
        ret.add(right.get(right.size() - 1));
        return ret;
    }
}

This class is threaded, so it is possible to take advantages of multicore processors.

A sample test class is shown here.

/*
Author: Zobayer Hasan
*/
public class FIMtest {

    public static void main(String[] args) {
        Database db = null;
        try {
            db = new Database("mushroom.dat");
            
        } catch(Exception e) {
            e.printStackTrace();
        }
        
        System.out.println("\nStarting Apriori");
        
        Apriori test1 = new Apriori("test1", db, 40.0);
        Apriori.debugger = true;
        test1.start();
        try {
            test1.join();
            test1.printPatterns();
        } catch(Exception e) {
            e.printStackTrace();
        }
    }
}

So, this is it, an efficient implementation of Apriori algorithm in java. It is possible that there are some lacking or mistakes in either source code or analysis. If you find any, please let me know. Here is a sample run on mushroom for 40% minimum support.

Have fun...

Thursday, December 13, 2012

Apriori in Java (Part 1)

Introduction:

Apriori is a very basic and straight forward algorithm for frequent pattern mining, I will not be discussing much about the approach, as those can already be studied from different lectures/books available on net. I will basically present an implementation of mine which is an efficient implementation of the standard apriori algorithm in Java. The target input files are taken from Frequent Itemset Mining Dataset Repository, for example, the mushroom dataset. Due to the space and runtime complexity, apriori is not suitable for larger files having 50,000 records or so. It also may take huge time for very dense files.

The Database:

First we need to read the files and store them in some data structure which will be easy to access and efficient at the same time. Here. I am presenting my java source code, and then I will present an explaining analysis why this works better.

/*
Author: Zobayer Hasan
*/

import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.InputStreamReader;
import java.util.ArrayList;
import java.util.Collections;
import java.util.Comparator;
import java.util.List;
import java.util.PriorityQueue;

class Entry {
    public Integer first;
    public Integer second;
    Entry() {}
    Entry(Integer first, Integer second) {
        this.first = first;
        this.second = second;
    }
}

public class Database {
    public static boolean debugger = false;
    
    private final List< List< Integer > > transactions;
    private final List< Integer > items;

    public Database(String dataFileName) throws Exception {
        if(debugger) {
            System.out.println("Processing " + dataFileName);
        }
        
        transactions = new ArrayList< List< Integer > >();
        items = new ArrayList< Integer >();
        
        FileInputStream fin = new FileInputStream(dataFileName);
        InputStreamReader istream = new InputStreamReader(fin);
        BufferedReader stdin = new BufferedReader(istream);
        
        String line;
        
        double startTime = System.currentTimeMillis();
        
        while((line = stdin.readLine()) != null) {
            List< Integer > transaction = new ArrayList< Integer >();
            String[] temp = line.split("\\s+");
            
            for(String num : temp) {
                transaction.add(Integer.parseInt(num));
            }
            
            if(transaction.isEmpty()) continue;
            
            Collections.sort(transaction);
            transactions.add(transaction);
        }
        
        fin.close();
        istream.close();
        stdin.close();
        
        int n = transactions.size();
        int[] header = new int[n];
        PriorityQueue< Entry > pQ = new PriorityQueue< Entry >(n, new Comparator< Entry >() {
            public int compare(Entry item1, Entry item2) {
                if(item1.first.equals(item2.first)) {
                    return item1.second.compareTo(item2.second);
                } else {
                    return item1.first.compareTo(item2.first);
                }
            }
        });
        
        for(int i = 0; i < n; i++) {
            header[i] = 0;
            pQ.add(new Entry(transactions.get(i).get(header[i]), i));
        }
        
        while(!pQ.isEmpty()) {
            Entry peek = pQ.remove();
            int val = peek.first;
            int idx = peek.second;
            if(items.isEmpty() || items.get(items.size()-1) < val) {
                items.add(val);
            }
            while(header[idx] < transactions.get(idx).size() && transactions.get(idx).get(header[idx]) <= val) {
                header[idx]++;
            }
            if(header[idx] < transactions.get(idx).size()) {
                pQ.add(new Entry(transactions.get(idx).get(header[idx]), idx));
            }
        }
        
        double endTime = System.currentTimeMillis();
        System.out.println("Database created in " + (endTime - startTime)/1000.0 + " seconds");
    }
    
    public int scanDatabase(List< Integer > transaction) {
        int count = 0;
        for(List< Integer > row : transactions) {
            boolean found = true;
            for(Integer item : transaction) {
                int idx, stp, st = 0, en = row.size(), cnt = en - st;
                while(cnt > 0) {
                    stp = cnt >> 1; idx = st + stp;
                    if(row.get(idx).compareTo(item) < 0) {
                        st = ++idx;
                        cnt -= stp+1;
                    }
                    else {
                        cnt = stp;
                    }
                }
                if(st == row.size() || row.get(st).compareTo(item) != 0) {
                    found = false;
                    break;
                }
            }
            if(found) count++;
        }
        return count;
    }
    
    public List< Integer > getItemset() {
        return items;
    }
    
    public int dbSize() {
        return transactions.size();
    }
    
    public List< Integer > getRow(int row) {
        try {
            return transactions.get(row);
        } catch(Exception e) {
            throw e;
        }
    }
}

Clearly it provides some interface to access the database where the constructor must be called with a file name for example 'mushroom.dat'. The sorting on line 56 is not necessary if you know the transactions in the file will be sorted in ascending order. All the data files in this repository are already sorted, so sorting can be disabled.

Now, if we look at the constructor, what is this huge code doing actually? First it treads each transaction as a string, then parses it and inserts the transaction in a sorted list. Now, we need a list of unique elements. Well, this could be done by sorting all the records together, and then eliminating duplicates. However, this can be performed better with the help of a priority queue in O(NK log N) time, where we have n transactions, and each transaction has K records on an average. Actually this works much better than it would be in naive approach which would take O(NK log NK) because of extra sorting overhead.

Then comes the last important part, scanDatabase() method. Which basically searches each record in the database for a set of elements. It takes O(NK log K) time at most, where, we have N records, each having an average length of K. This is much better than looking in O(N*K^2) following a naive approach. Here is some improvement possible, if we know the length of transactions, we could use a bit vector in addition to each transaction. Then the query can be performed in O(NK). However, as the datasets are huge, I didn't go for it due to the necessity of using huge memory.

Part 2 contains the source code and explanation for the apriori algorithm.

Saturday, October 13, 2012

Static Routing In GNS3

Actually this is our first lab using GNS3. The previous post contains a few info on how to open a project and save along with configurations. In previous post, we initiated RIP on routers which computed the routing table entries automatically. Here I will add the first experiment which required manual routing table entry.

Simply open a new project and create a simple topology like shown below. (Click on any image to enlarge)

Here we have 4 routers, where routers R1 and R4 uses only one interface and R2, R3 uses two interfaces. First we need to configure each interface in use. So, after starting all the routers, we open their terminal and then configure the interfaces as we did in previous post. Configuring R1:

Configuring R2:

Configuring R3:

Configuring R4:

Now you can use show ip route command to see routing table on each table. You will see that the routing tables are incomplete. In fact, if you try to ping 192.168.3.1 from R1, it wont be possible to do so. Because ping packets need a round trip route. So, in router R1, we will need to add routing table entries for network 192.168.2.0 and 192.168.3.0. Similarly, router R2 will only need information for network 192.168.3.0 as the other two networks are directly connected to it.. R3 also only need information for network 192.168.1.0 while R4 needs information for network 192.168.1.0 and 192.168.2.0. We will now add these information. In config mode, we can use ip route to add a routing table entry. Next screenshots shows entries for each router.

For router R1:

For router R2:

For router R3:

For router R4:

Now try to ping from each router to each interface, and it works :D Also use show ip route in routers to see the updated routing table entries we have just added. Have fun experimenting!

Simple RIP Routing In GNS3

This is a very basic lab in GNS3 where we will create a simple ring topology of routers and ehternet switches and then run RIP routing protocol on these routers. When we were experimenting in network simulation lab, we found that it is very hard to find simple GNS3 examples showing very basic things. So I am going to put them here as we follow the lab experiments. We will be using CISCO c2600 routers. It is possible to do similar experiment using other routers for which you have an IOS image.

Designing a topology is very easy in GNS3. Just drag and drop from left side dock. Make sure to open a empty project and in new blank project dialogue, make sure both the checkboxes are ticked. Also, make sure idle values of routers are set. If not set yet, right click on a router and select idle-pc. Here is the snapshot of a network we will be using here. (Click on any image to enlarge)

Here each router has two interfaces, namely f0/0 and f0/1. We need to setup ip addresses on each interfaces of all 5 routers. Here I will show the configuration commands for router 1. Make sure all the routers are started, and to get the console, right click on a router and select console.

enable

R1#conf t

R1(config)#int f0/0

R1(config-if)#ip add 192.168.5.2 255.255.255.0

R1(config-if)#no sh

R1(config-if)#exit

R1(config)#int f0/1

R1(config-if)#ip add 192.168.1.1 255.255.255.0

R1(config-if)#no sh

R1(config-if)#end

R1#copy running-config startup-config

Here is a screenshot

Some explanation of short forms: conf t stands for configure terminal, which is simply telling that we will be configuring this router via terminal. However there are other ways to do so. Then we specify the interface we wish to configure in this router, for example: interface f0/0. Then we add ip address and subnet mask. Next is no sh, which stands for no shutdown. It means, this interface will remain "up" the whole time. Here, we can bring it down by providing a shutdown command.

So after configuring all the interfaces, we save the project and copy running configuration to startup configuration file, so that if we close or stop it, we will be able to load the settings directly without having it configured manually again which is a great hassle.

In the same process, we configure all the interfaces in all the routers. This is the first step of this lab. Now we need to tell the routers that we are going to run RIP on them.

To do so, go into configure mode again (by giving conf t command).

R1(config)#router rip

R1(config-router)#network 192.168.1.0

R1(config-router)#network 192.168.5.0

R1(config-router)#end

As we can see, every router is associated with two networks, so you just provide the network addresses. As we did before, we copy the running configuration to startup configuration once again. After configuring all the routers, we can check their IP routing table to see if all the networks have been discovered by RIP. Snapshot of this step:

Now that all the paths have been discovered. We can now test the network by pinging all the IPs to ensure that all the nodes are reachable from all the others. For example:

The final requirement was to test how RIP respond to any change on the network. For example on router R1, we can see that it has two ways to reach network 192.168.3.0. What happens if we disable interface f0/1, it should not be able to forward packets via hop 192.168.1.2. So we enter configuration mode again and select interface f0/1 on router R1, and give "shutdown" command which will stop interface f0/1. Then we wait a 2-3 seconds (in slow machines, we may need to wait longer, even minutes). Now if you check routing table on R1 by show ip route command, you will see that, network 192.168.3.0 has now only one way. Snapshot is given below. Compare it with what you got last time with show ip route command. You can enable this interface again by putting "no shutdown" command.

So, these are the simple commands we used to complete this task. Have fun with experimenting,

Tuesday, May 17, 2011

Writing My Own Shell

Hello everyone, I am back after a huge pause, actually I was busy for nothing the past few months, and felt the necessity to write something here. So I am going to note down something I recently did. I just finished writing my own shell program for linux, which is simple, doesn't support piping or redirection, but most of the features are available, so, it actually looks like the common "sh" shell. I am not going to describe the big code, you have to understand what it is doing :p

The source code. Its free to use. I have written the whole code by myself and surely I am not an expert, so if anything happens to your computer, I don't know you :p

To compile, you can use this makefile in su mode (note tab character is important).

Some features, you can invoke single commands like this:

$ doit <command [arguments...]>

Also, you can start the shell by simply calling it without any argument:

$ doit

Then it will behave like a shell program and you can give commands one by one just like a simple shell. It implements its own "cd", "pwd" and "jobs" command (you should know what they mean). You can call background processes by appending & at the end of the command... and so on. You can display resource utilizations as well. Just dig through the code, I will enhance it more if I get time :)