No lab on this wednesday 30th, sorry. There will be less exercises and more labs in the second half of the course, data mining.
If you have time, start downloading MOA http://moa.cms.waikato.ac.nz/ and playing with it.
Also, I'm at a workshop next week (May 6th-9th), so no theory and no lab.
Ricard
Monday, April 28, 2014
Tuesday, April 22, 2014
Tuesday, April 8, 2014
Saturday, April 5, 2014
Glitches in lab 1, c++ source
I've found a couple of glitches when doing the lab myself and using the c++ routines I pointed to.
- in my system, though int's are 32 bits, the constants RAND_MAX and LONG_PRIME are 16 bits (so at most 2^15-1). This gives far too little randomness for checking large sets of items. I've reposted distrib.h which simulates (badly) a 32 random bit generator. Also, if this happens in your system, you may want to change the lines
hashes[i][0] = int(float(rand())*float(LONG_PRIME)/float(RAND_MAX) + 1);
hashes[i][1] = int(float(rand())*float(LONG_PRIME)/float(RAND_MAX) + 1);
in the genajbj method of count_min_sketch.cpp with, for example,
hashes[i][0] = int(float(rand())*float(LONG_PRIME)/float(RAND_MAX) + 1);
hashes[i][0] *= RAND_MAX;
hashes[i][0] += int(float(rand())*float(LONG_PRIME)/float(RAND_MAX) + 1);
hashes[i][1] = int(float(rand())*float(LONG_PRIME)/float(RAND_MAX) + 1);
hashes[i][1] *= RAND_MAX;
hashes[i][1] += int(float(rand())*float(LONG_PRIME)/float(RAND_MAX) + 1);
- watch out because CountMinSketch::update takes an int, but CountMinSketch::estimate returns an unsigned int. Watch subtractions with unsigneds which may give nonsensical results; use casts appropriately.
Hopefully these things don't show up in other languages.
- in my system, though int's are 32 bits, the constants RAND_MAX and LONG_PRIME are 16 bits (so at most 2^15-1). This gives far too little randomness for checking large sets of items. I've reposted distrib.h which simulates (badly) a 32 random bit generator. Also, if this happens in your system, you may want to change the lines
hashes[i][0] = int(float(rand())*float(LONG_PRIME)/float(RAND_MAX) + 1);
hashes[i][1] = int(float(rand())*float(LONG_PRIME)/float(RAND_MAX) + 1);
in the genajbj method of count_min_sketch.cpp with, for example,
hashes[i][0] = int(float(rand())*float(LONG_PRIME)/float(RAND_MAX) + 1);
hashes[i][0] *= RAND_MAX;
hashes[i][0] += int(float(rand())*float(LONG_PRIME)/float(RAND_MAX) + 1);
hashes[i][1] = int(float(rand())*float(LONG_PRIME)/float(RAND_MAX) + 1);
hashes[i][1] *= RAND_MAX;
hashes[i][1] += int(float(rand())*float(LONG_PRIME)/float(RAND_MAX) + 1);
- watch out because CountMinSketch::update takes an int, but CountMinSketch::estimate returns an unsigned int. Watch subtractions with unsigneds which may give nonsensical results; use casts appropriately.
Hopefully these things don't show up in other languages.
Wednesday, April 2, 2014
Lecture 3 & Lab 1
Lecture:
- I thought about the discussion on the Karp+ algorithm. An update rule that works is:
if x is in K then count[x]++
else {
if |K | = k then
for all a in K do {
count[a]−−;
if count[a] = 0 then delete a from K & count
}
if |K| < k { insert x in K , set count[x] := 1; }
}
Note that x will not be inserted if sketch is full and no empty place appears after discounting. The proof works now. Intuitively, x may not get in this time, but does its job of discounting k symbols; so if it is frequent it will eventually make it into K when an infrequent element gets to 0 count.
- About the lab: for credit, hand me a short report (2 pages max) with your results. No code fragments, no screenshots please. I said within 2 weeks but it's easter break, so let's say for the 23rd.
- I thought about the discussion on the Karp+ algorithm. An update rule that works is:
if x is in K then count[x]++
else {
if |K | = k then
for all a in K do {
count[a]−−;
if count[a] = 0 then delete a from K & count
}
if |K| < k { insert x in K , set count[x] := 1; }
}
Note that x will not be inserted if sketch is full and no empty place appears after discounting. The proof works now. Intuitively, x may not get in this time, but does its job of discounting k symbols; so if it is frequent it will eventually make it into K when an infrequent element gets to 0 count.
- About the lab: for credit, hand me a short report (2 pages max) with your results. No code fragments, no screenshots please. I said within 2 weeks but it's easter break, so let's say for the 23rd.
Subscribe to:
Posts (Atom)