Monday, April 28, 2014

Next two weeks

No lab on this wednesday 30th, sorry. There will be less exercises and more labs in the second half of the course, data mining.

If you have time, start downloading MOA and playing with it.

Also, I'm at a workshop next week (May 6th-9th), so no theory and no lab.


Tuesday, April 22, 2014

Tuesday, April 8, 2014

Saturday, April 5, 2014

Glitches in lab 1, c++ source

I've found a couple of glitches when doing the lab myself and using the c++ routines I pointed to.

- in my system, though int's are 32 bits, the constants RAND_MAX and LONG_PRIME are 16 bits (so at most 2^15-1). This gives far too little randomness for checking large sets of items. I've reposted distrib.h which simulates (badly) a 32 random bit generator. Also, if this happens in your system, you may want to change the lines

  hashes[i][0] = int(float(rand())*float(LONG_PRIME)/float(RAND_MAX) + 1);
  hashes[i][1] = int(float(rand())*float(LONG_PRIME)/float(RAND_MAX) + 1);

in the genajbj method of count_min_sketch.cpp with, for example,

  hashes[i][0] = int(float(rand())*float(LONG_PRIME)/float(RAND_MAX) + 1);
  hashes[i][0] *= RAND_MAX;
  hashes[i][0] += int(float(rand())*float(LONG_PRIME)/float(RAND_MAX) + 1);
  hashes[i][1] = int(float(rand())*float(LONG_PRIME)/float(RAND_MAX) + 1);
  hashes[i][1] *= RAND_MAX;
  hashes[i][1] += int(float(rand())*float(LONG_PRIME)/float(RAND_MAX) + 1);

- watch out because CountMinSketch::update takes an int, but CountMinSketch::estimate returns an unsigned int. Watch subtractions with unsigneds which may give nonsensical results; use casts appropriately.

Hopefully these things don't show up in other languages.

Wednesday, April 2, 2014

Lecture 3 & Lab 1


- I thought about the discussion on the Karp+ algorithm. An update rule that works is:

   if x is in K then count[x]++
   else {
        if |K | = k then
               for all a in K do {
                   if count[a] = 0 then delete a from K & count
        if |K| <  k { insert x in K , set count[x] := 1; }

Note that x will not be inserted if sketch is full and no empty place appears after discounting. The proof works now. Intuitively, x may not get in this time, but does its job of discounting k symbols; so if it is frequent it will eventually make it into K when an infrequent element gets to 0 count.

- About the lab: for credit, hand me a short report (2 pages max) with your results. No code fragments, no screenshots please. I said within 2 weeks but it's easter break, so let's say for the 23rd.