I've found a couple of glitches when doing the lab myself and using the c++ routines I pointed to.
- in my system, though int's are 32 bits, the constants RAND_MAX and LONG_PRIME are 16 bits (so at most 2^15-1). This gives far too little randomness for checking large sets of items. I've reposted distrib.h which simulates (badly) a 32 random bit generator. Also, if this happens in your system, you may want to change the lines
hashes[i][0] = int(float(rand())*float(LONG_PRIME)/float(RAND_MAX) + 1);
hashes[i][1] = int(float(rand())*float(LONG_PRIME)/float(RAND_MAX) + 1);
in the genajbj method of count_min_sketch.cpp with, for example,
hashes[i][0] = int(float(rand())*float(LONG_PRIME)/float(RAND_MAX) + 1);
hashes[i][0] *= RAND_MAX;
hashes[i][0] += int(float(rand())*float(LONG_PRIME)/float(RAND_MAX) + 1);
hashes[i][1] = int(float(rand())*float(LONG_PRIME)/float(RAND_MAX) + 1);
hashes[i][1] *= RAND_MAX;
hashes[i][1] += int(float(rand())*float(LONG_PRIME)/float(RAND_MAX) + 1);
- watch out because CountMinSketch::update takes an int, but CountMinSketch::estimate returns an unsigned int. Watch subtractions with unsigneds which may give nonsensical results; use casts appropriately.
Hopefully these things don't show up in other languages.
No comments:
Post a Comment