## Friday, December 21, 2012

### Section 3.3: Riding the Fences

This is just another simple Euler path problem. Given a undirected unweighted graph, which contains multiple edges, you have to find the lexicographically smallest Euler path.

I don't know what the problem setter was thinking, but he made a big mistake. Problem statement says "Your program must output the path of intersections that, if interpreted as a base 500 number, would have the smallest magnitude." where nodes are numbered from 1 to 500. Clearly, 500 is not a base 500 digit. I made several wrong submissions for trying to do some foolish tricks to get this done, then I just did a lexicographic ordering, and it passed. Pretty funny!

Problem statement ensures that there is no input for which Euler path does not exist. This makes the solution easier, just check if all the degrees are even, if so, start from the smallest numbered node, otherwise start from smaller numbered odd degree node.

```algorithm for finding euler path:
graph[u][v] contains number of edges between u and v

find( node u ):
for each adjacent v of u in lexicographical order:
while( graph[u][v] > 0 ):
graph[u][v]--;
graph[v][u]--;
find(v);
stack.push(u);
```
Now the stack contains the euler path in reverse order, so just keep popping!

## Wednesday, December 19, 2012

### SPOJ: 348. Expedition

Nice problem! The first observation is, if you want to reach the city, say, point 0, you have to ensure that every single point between the current position and city must also be reachable. Now, the task is to minimize the number of stoppages for fuel, which is at most 10000. So, we sort the fuel stations, and start from current position. For every fuel station, if we want to reach it, we must have fuel f more than or equal to the distance d. Also, using the larger capacities will always reduce the number of stations we must stop.

So, for each stoppage, starting from farthest, if we can reach this stoppage with existing fuel, we push it in a priority queue (max heap in this case) for future use. If we cannot reach a particular stoppage, then we keep popping the queue and keep adding the amounts with currently available until we are able to reach current stoppage, and then pushing this value. This strategy ensures an optimal solution to reach a particular stoppage, as the priority queue will hold maximum capacities seen so far along the path, but not used yet.

If at any time, the queue gets empty and the amount of fuel is not sufficient, then the particular stoppage cannot be reached, hence, it will be impossible to reach the city.

Happy Coding

### SPOJ: 11409. Fibonacci With a Twist

After passing hours on it, finally I was able to solve it. Problem statement is pretty clear, given a recursive function, find its nth term.

f(0) = 0
f(1) = 1
f(n) = f(n-1) + f(n-2) + (n-1)

The given range on n makes it obvious that this solution must have a logarithmic algorithm or constant close form. However, I don't know about the close form, mathematics was not my thing. I am also not sure if this can be solved directly using matrix exponentiation.

Looking at the recurrence, it is obvious that the equation can be written in the following form for higher n:

f(n) = anf(0) + bnf(1) + cn

Here, an, bn and cn are coefficients for nth term. We can also discard all the terms containing f(0) which is actually 0. So, if we continue writing this, we can find a really nice pattern:

f(0) = 0
f(1) = 1
f(2) = f(1) + (2-1) = 1 + (1)
f(3) = f(2) + f(1) + (3-1) = 2 + (1) + (2)
f(4) = f(3) + f(2) + (4-1) = 3 + 2(1) + (2) + (3)

similarly...
f(5) = 5 + 3(1) + 2(2) + (3) + (4)
f(6) = 8 + 5(1) + 3(2) + 2(3) + (4) + (5)
f(7) = 13 + 8(1) + 5(2) + 3(3) + 2(4) + (5) + (6)
f(8) = 21 + 13(1) + 8(2) + 5(3) + 3(4) + 2(5) + (6) + (7)

I think some Fibonacci numbers caught our eyes already. So, now we try to generalize. Just to note: we can do this because the coefficients of similar terms are consecutive fibonacci numbers.
f(n) = fib(n) + fib(n-1)*1 + fib(n-2)*2 + fib(n-3)*3 + ... ... + fib(1)*(n-1)
f(n+1) = fib(n+1) + fib(n)*1 + fib(n-1)*2 + fib(n-2)*3 + ... ... + fib(1) * n

subtracting f(n) from f(n+1), we can get:
f(n+1)-f(n) = fib(n+1) + {fib(n-1) + ... ... + fib(1)}
f(n+1)-f(n) = fib(n+1) + fib(n+1) - 1 [sum(fib(n)) = fib(n+2)-1]
f(n+1) = f(n) + 2fib(n+1) - 1

This can be solved directly using matrix exponentiation. However, we will need a 4x4 matrix to do that. But, we can actually reduce this equation to a non recursive version:

f(n+1) = f(n) + 2fib(n+1) - 1
f(n+1) = f(n-1) 2fib(n) + 2fib(n+1) - 2
f(n+1) = f(n-2) + 2fib(n-1) + 2fib(n) + 2fib(n+1) - 3
f(n+1) = f(0) + 2fib(1) + 2fib(2) + 2fib(3) + ... ... + 2fib(n+1) - (n+1)
f(n+1) = 2(fib(n+3) - 1) - (n+1)

So, rewriting for f(n):
f(n) = 2(fib(n+2) - 1) - n

Now, this is no more a hard recursive function, we just need to know (n+2)th fibonacci term, which can be computed using only a 2x2 matrix. This post shows you how, if you need to know.

However, for this specific problem, be careful about the final result, you may need modulus correction.

## Friday, December 14, 2012

(Part 1)

### Apriori Algorithm:

The algorithm works as follows: first it generates all the frequent itemsets of length 1 w.r.t. the given threshold (minimum support). Then it continue to generate itemsets of lengths 2,3, ... ,n if possible. There are lots of improvements and pruning possible in the implementation.

Do we need to sort frequently? First observation is, if the items are taken in sorted order on step k = 1, all the frequent patterns generated in future will also be in sorted order if maintained properly. So this eliminates the necessity of sorting or storing itemsets in logarithmic data structures like map or set. Rather, we can store them in arraylist or vector like data structure.

How do we generate itemsets of length k+1? We generate itemsets of length k+1 by merging two itemsets of length k. If two itemsets I1 and I2 have a common prefix of length k-1, and I1[k] < I2[k], we can take I1[1 ... k-1] I1[k] I2[k] which is an itemset of length k+1. As our itemsets are sorted, and frequent itemsets generated are also sorted, this can be done in O(N*K^2). Well, if we follow the naive approach, it will take O(N*K^3), however, as we can pre-calculate the length of common prefix of consecutive items in O(N*K), later, we can use this to perform joining operation stated above and also do early termination rather than looking for the entire N items. The approach is demonstrated in source code.

What are the prunings? The most important observation on apriori algorithm is, if an itemset is not frequent, none of its superset can be frequent. Which also tells us, if an itemset has to be frequent, all of its subset must be frequent. Well, the first one is automatically checked in the algorithm. As it takes the frequent itemsets of previous step to calculate itemsets of current step. Now, how the second rule is checked? Well, we don't need to check all the subsets really, all we need is to check the subsets of length k-1 for step k, which can be easily done by performing manual hashing, or using java hashmap. This saves a lots of hassles.

```/*
Author: Zobayer Hasan
*/
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;

public class Apriori extends Thread {
public static boolean debugger = false;

private final Database db;
private final List< Integer > itemset;
private final List< List< Integer > > frequent;
private double minsup;

public Apriori(String thrName, Database db, double minsup) {
super(thrName);
this.db = db;
itemset = db.getItemset();
frequent = new ArrayList< List< Integer > >();
this.minsup = minsup;
}

@Override
public void run() {
double startTime = System.currentTimeMillis();

int k = 1, n = db.dbSize();
List< List< Integer > > Ck = new ArrayList< List< Integer > >();
List< List< Integer > > Lk = new ArrayList< List< Integer > >();
HashMap< List< Integer>, Integer > seenK = new HashMap< List< Integer >, Integer >();

for(Integer item : itemset) {
List< Integer > temp = new ArrayList< Integer >();
temp.add(item);
Lk.add(temp);
}

while(k <= n && !Lk.isEmpty()) {
if(debugger) {
System.out.println("Step " + k);
System.out.println("Lk: " + Lk);
}

seenK.clear();
Ck.clear();
for(List< Integer > kth : Lk) {
int count = db.scanDatabase(kth);
if((double)count < Math.ceil(minsup * (double)n / 100.0)) continue;
Ck.add(kth);
}

if(debugger) {
System.out.println("Ck: " + Ck);
}

if(Ck.isEmpty()) break;

for(List< Integer > freq : Ck) {
frequent.add(freq);
seenK.put(freq, k);
}

int[] prefixlen = new int[Ck.size()];
prefixlen[0] = 0;
for(int i = 1; i < Ck.size(); i++) {
prefixlen[i] = prefixLen(Ck.get(i-1), Ck.get(i));
}

List< List< Integer > > temp = new ArrayList< List< Integer > >();
for(int i = 0; i < Ck.size(); i++) {
for(int j = i + 1; j < Ck.size(); j++) {
if(prefixlen[j] == k-1) {
if(debugger) {
System.out.println("Joining: " + i + ":" + Ck.get(i) + " + " + j + ":" + Ck.get(j) + " Prefix Length " + prefixlen[j]);
}
temp.add(prefixJoin(Ck.get(i), Ck.get(j)));
}
else break;
}
}

if(debugger) {
System.out.println("Temporary: " + temp);
}

Lk.clear();
for(List< Integer > list : temp) {
boolean candid = true;
if(k > 1) {
for(int i = 0; i < list.size(); i++) {
List< Integer > prev = new ArrayList< Integer >();
for(int j = 0; j < list.size(); j++) {
if(i != j) prev.add(list.get(j));
}
if(!seenK.containsKey(prev)) {
candid = false;
break;
}
}
}
if(candid) {
Lk.add(list);
}
}

if(debugger) {
System.out.println("Pruned: " + Lk);
}

k++;
}

double endTime = System.currentTimeMillis();
System.out.println("Apriori completed in " + (endTime - startTime)/1000.0 + " seconds");
}

public void printPatterns() {
System.out.println("Frequent Itemsets");
for(List< Integer > pattern : frequent) {
System.out.println(pattern);
}
System.out.println("Total " + frequent.size() + " itemsets");
}

private int prefixLen(List< Integer > left, List< Integer > right) {
int len = 0;
for(len = 0; len < left.size() && len < right.size(); len++) {
if(left.get(len).compareTo(right.get(len)) != 0) return len;
}
return len;
}

private List< Integer > prefixJoin(List< Integer > left, List< Integer > right) {
List< Integer > ret = new ArrayList< Integer >();
for(Integer i : left) {
ret.add(i);
}
ret.add(right.get(right.size() - 1));
return ret;
}
}
```
This class is threaded, so it is possible to take advantages of multicore processors.

A sample test class is shown here.
```/*
Author: Zobayer Hasan
*/
public class FIMtest {

public static void main(String[] args) {
Database db = null;
try {
db = new Database("mushroom.dat");

} catch(Exception e) {
e.printStackTrace();
}

System.out.println("\nStarting Apriori");

Apriori test1 = new Apriori("test1", db, 40.0);
Apriori.debugger = true;
test1.start();
try {
test1.join();
test1.printPatterns();
} catch(Exception e) {
e.printStackTrace();
}
}
}
```

So, this is it, an efficient implementation of Apriori algorithm in java. It is possible that there are some lacking or mistakes in either source code or analysis. If you find any, please let me know. Here is a sample run on mushroom for 40% minimum support.

Have fun...

## Thursday, December 13, 2012

### Introduction:

Apriori is a very basic and straight forward algorithm for frequent pattern mining, I will not be discussing much about the approach, as those can already be studied from different lectures/books available on net. I will basically present an implementation of mine which is an efficient implementation of the standard apriori algorithm in Java. The target input files are taken from Frequent Itemset Mining Dataset Repository, for example, the mushroom dataset. Due to the space and runtime complexity, apriori is not suitable for larger files having 50,000 records or so. It also may take huge time for very dense files.

### The Database:

First we need to read the files and store them in some data structure which will be easy to access and efficient at the same time. Here. I am presenting my java source code, and then I will present an explaining analysis why this works better.
```/*
Author: Zobayer Hasan
*/

import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.InputStreamReader;
import java.util.ArrayList;
import java.util.Collections;
import java.util.Comparator;
import java.util.List;
import java.util.PriorityQueue;

class Entry {
public Integer first;
public Integer second;
Entry() {}
Entry(Integer first, Integer second) {
this.first = first;
this.second = second;
}
}

public class Database {
public static boolean debugger = false;

private final List< List< Integer > > transactions;
private final List< Integer > items;

public Database(String dataFileName) throws Exception {
if(debugger) {
System.out.println("Processing " + dataFileName);
}

transactions = new ArrayList< List< Integer > >();
items = new ArrayList< Integer >();

FileInputStream fin = new FileInputStream(dataFileName);
InputStreamReader istream = new InputStreamReader(fin);
BufferedReader stdin = new BufferedReader(istream);

String line;

double startTime = System.currentTimeMillis();

while((line = stdin.readLine()) != null) {
List< Integer > transaction = new ArrayList< Integer >();
String[] temp = line.split("\\s+");

for(String num : temp) {
transaction.add(Integer.parseInt(num));
}

if(transaction.isEmpty()) continue;

Collections.sort(transaction);
transactions.add(transaction);
}

fin.close();
istream.close();
stdin.close();

int n = transactions.size();
int[] header = new int[n];
PriorityQueue< Entry > pQ = new PriorityQueue< Entry >(n, new Comparator< Entry >() {
public int compare(Entry item1, Entry item2) {
if(item1.first.equals(item2.first)) {
return item1.second.compareTo(item2.second);
} else {
return item1.first.compareTo(item2.first);
}
}
});

for(int i = 0; i < n; i++) {
header[i] = 0;
pQ.add(new Entry(transactions.get(i).get(header[i]), i));
}

while(!pQ.isEmpty()) {
Entry peek = pQ.remove();
int val = peek.first;
int idx = peek.second;
if(items.isEmpty() || items.get(items.size()-1) < val) {
items.add(val);
}
while(header[idx] < transactions.get(idx).size() && transactions.get(idx).get(header[idx]) <= val) {
header[idx]++;
}
if(header[idx] < transactions.get(idx).size()) {
pQ.add(new Entry(transactions.get(idx).get(header[idx]), idx));
}
}

double endTime = System.currentTimeMillis();
System.out.println("Database created in " + (endTime - startTime)/1000.0 + " seconds");
}

public int scanDatabase(List< Integer > transaction) {
int count = 0;
for(List< Integer > row : transactions) {
boolean found = true;
for(Integer item : transaction) {
int idx, stp, st = 0, en = row.size(), cnt = en - st;
while(cnt > 0) {
stp = cnt >> 1; idx = st + stp;
if(row.get(idx).compareTo(item) < 0) {
st = ++idx;
cnt -= stp+1;
}
else {
cnt = stp;
}
}
if(st == row.size() || row.get(st).compareTo(item) != 0) {
found = false;
break;
}
}
if(found) count++;
}
return count;
}

public List< Integer > getItemset() {
return items;
}

public int dbSize() {
return transactions.size();
}

public List< Integer > getRow(int row) {
try {
return transactions.get(row);
} catch(Exception e) {
throw e;
}
}
}
```
Clearly it provides some interface to access the database where the constructor must be called with a file name for example 'mushroom.dat'. The sorting on line 56 is not necessary if you know the transactions in the file will be sorted in ascending order. All the data files in this repository are already sorted, so sorting can be disabled.

Now, if we look at the constructor, what is this huge code doing actually? First it treads each transaction as a string, then parses it and inserts the transaction in a sorted list. Now, we need a list of unique elements. Well, this could be done by sorting all the records together, and then eliminating duplicates. However, this can be performed better with the help of a priority queue in O(NK log N) time, where we have n transactions, and each transaction has K records on an average. Actually this works much better than it would be in naive approach which would take O(NK log NK) because of extra sorting overhead.

Then comes the last important part, scanDatabase() method. Which basically searches each record in the database for a set of elements. It takes O(NK log K) time at most, where, we have N records, each having an average length of K. This is much better than looking in O(N*K^2) following a naive approach. Here is some improvement possible, if we know the length of transactions, we could use a bit vector in addition to each transaction. Then the query can be performed in O(NK). However, as the datasets are huge, I didn't go for it due to the necessity of using huge memory.

Part 2 contains the source code and explanation for the apriori algorithm.