Last lecture today, on predictor evaluation, more predictors, and clustering.
No lab finally.
This is about it for the seminar.
Tuesday, May 27, 2014
Tuesday, May 20, 2014
Wednesday May 21st
We'll have a lecture on frequent pattern mining. Slides here right before the class.
No lab today.
No lab today.
Tuesday, May 13, 2014
Monday, April 28, 2014
Next two weeks
No lab on this wednesday 30th, sorry. There will be less exercises and more labs in the second half of the course, data mining.
If you have time, start downloading MOA http://moa.cms.waikato.ac.nz/ and playing with it.
Also, I'm at a workshop next week (May 6th-9th), so no theory and no lab.
Ricard
If you have time, start downloading MOA http://moa.cms.waikato.ac.nz/ and playing with it.
Also, I'm at a workshop next week (May 6th-9th), so no theory and no lab.
Ricard
Tuesday, April 22, 2014
Tuesday, April 8, 2014
Saturday, April 5, 2014
Glitches in lab 1, c++ source
I've found a couple of glitches when doing the lab myself and using the c++ routines I pointed to.
- in my system, though int's are 32 bits, the constants RAND_MAX and LONG_PRIME are 16 bits (so at most 2^15-1). This gives far too little randomness for checking large sets of items. I've reposted distrib.h which simulates (badly) a 32 random bit generator. Also, if this happens in your system, you may want to change the lines
hashes[i][0] = int(float(rand())*float(LONG_PRIME)/float(RAND_MAX) + 1);
hashes[i][1] = int(float(rand())*float(LONG_PRIME)/float(RAND_MAX) + 1);
in the genajbj method of count_min_sketch.cpp with, for example,
hashes[i][0] = int(float(rand())*float(LONG_PRIME)/float(RAND_MAX) + 1);
hashes[i][0] *= RAND_MAX;
hashes[i][0] += int(float(rand())*float(LONG_PRIME)/float(RAND_MAX) + 1);
hashes[i][1] = int(float(rand())*float(LONG_PRIME)/float(RAND_MAX) + 1);
hashes[i][1] *= RAND_MAX;
hashes[i][1] += int(float(rand())*float(LONG_PRIME)/float(RAND_MAX) + 1);
- watch out because CountMinSketch::update takes an int, but CountMinSketch::estimate returns an unsigned int. Watch subtractions with unsigneds which may give nonsensical results; use casts appropriately.
Hopefully these things don't show up in other languages.
- in my system, though int's are 32 bits, the constants RAND_MAX and LONG_PRIME are 16 bits (so at most 2^15-1). This gives far too little randomness for checking large sets of items. I've reposted distrib.h which simulates (badly) a 32 random bit generator. Also, if this happens in your system, you may want to change the lines
hashes[i][0] = int(float(rand())*float(LONG_PRIME)/float(RAND_MAX) + 1);
hashes[i][1] = int(float(rand())*float(LONG_PRIME)/float(RAND_MAX) + 1);
in the genajbj method of count_min_sketch.cpp with, for example,
hashes[i][0] = int(float(rand())*float(LONG_PRIME)/float(RAND_MAX) + 1);
hashes[i][0] *= RAND_MAX;
hashes[i][0] += int(float(rand())*float(LONG_PRIME)/float(RAND_MAX) + 1);
hashes[i][1] = int(float(rand())*float(LONG_PRIME)/float(RAND_MAX) + 1);
hashes[i][1] *= RAND_MAX;
hashes[i][1] += int(float(rand())*float(LONG_PRIME)/float(RAND_MAX) + 1);
- watch out because CountMinSketch::update takes an int, but CountMinSketch::estimate returns an unsigned int. Watch subtractions with unsigneds which may give nonsensical results; use casts appropriately.
Hopefully these things don't show up in other languages.
Wednesday, April 2, 2014
Lecture 3 & Lab 1
Lecture:
- I thought about the discussion on the Karp+ algorithm. An update rule that works is:
if x is in K then count[x]++
else {
if |K | = k then
for all a in K do {
count[a]−−;
if count[a] = 0 then delete a from K & count
}
if |K| < k { insert x in K , set count[x] := 1; }
}
Note that x will not be inserted if sketch is full and no empty place appears after discounting. The proof works now. Intuitively, x may not get in this time, but does its job of discounting k symbols; so if it is frequent it will eventually make it into K when an infrequent element gets to 0 count.
- About the lab: for credit, hand me a short report (2 pages max) with your results. No code fragments, no screenshots please. I said within 2 weeks but it's easter break, so let's say for the 23rd.
- I thought about the discussion on the Karp+ algorithm. An update rule that works is:
if x is in K then count[x]++
else {
if |K | = k then
for all a in K do {
count[a]−−;
if count[a] = 0 then delete a from K & count
}
if |K| < k { insert x in K , set count[x] := 1; }
}
Note that x will not be inserted if sketch is full and no empty place appears after discounting. The proof works now. Intuitively, x may not get in this time, but does its job of discounting k symbols; so if it is frequent it will eventually make it into K when an infrequent element gets to 0 count.
- About the lab: for credit, hand me a short report (2 pages max) with your results. No code fragments, no screenshots please. I said within 2 weeks but it's easter break, so let's say for the 23rd.
Monday, March 24, 2014
No lab wednesday 26th
Still no lab this wednesday, Mar 26th. Theory only.
Slides for Lectures 2 and 3 are available. Of course, they may change at any time without warning. The references will be in the reference page soon.
Note: In Lecture 1 I have corrected a couple of typos in the statements and applications of Chernoff-Hoeffding.
Slides for Lectures 2 and 3 are available. Of course, they may change at any time without warning. The references will be in the reference page soon.
Note: In Lecture 1 I have corrected a couple of typos in the statements and applications of Chernoff-Hoeffding.
Wednesday, March 19, 2014
Lecture 1. The Data Stream model. Counting. Probability tools
Today we went through some course logistics and covered the slides for Lecture 1.
There are two exercises in the slides. To get credit, they should be given to me (by any means, in any format) within two weeks, i.e. by the start of the April 2nd class.
All exercises and lab assignments are individual. If you want to solve them collaborating with someone, ask me first.
There are two exercises in the slides. To get credit, they should be given to me (by any means, in any format) within two weeks, i.e. by the start of the April 2nd class.
All exercises and lab assignments are individual. If you want to solve them collaborating with someone, ask me first.
Thursday, March 13, 2014
Welcome to the Spring 2014 (and first) edition of the MIRI Seminar on Data Streams
Logistics
- Instructor: Ricard Gavaldà . Mail: gavalda AT lsi then the UPC domain.
- Start date: Wednesday March 19th, 2014
- End date: Roughly end of May, 2014
- 3 ECTS credits, to be enrolled and claimed from MIRI at the end of the course.
- Time and places:
- Theory and problems: every Wednesday, 10:00 - 12:00, room A4104
- Labs: approx. every two Wednesdays, 14:00 - 16:00, room C6S303
- No lab yet on Wednesday March 19th
- There will be some gaps due to travels or other commitments. Will be announced on time.
- I will post all the materials here, including pretty detailed descriptions of the lab assignments. It should be possible, though harder, to follow the seminar on your own.
- Intended audience & requisites:
- Students enrolled in MIRI, particularly in the Advanced Computing and Data Mining & Business Intelligence specialities.
- But everybody is welcome.
- Some familiarity with probabilistic reasoning and algorithmics is assumed. Some familiarity with machine learning / data mining is helpful, but probably not essential. Some programming may be necessary. I will try to give as much freedom in the choice of programming language but MOA is mostly Java.
- Evaluation: based on a few exercise sets and a few deliverable lab assignments. Promised to be not too heavy. I am open to discussion for alternative evaluation methods.
- Website: http://www.lsi.upc.edu/~gavalda/DataStreamSeminar
Streaming is one of the central aspects of the ``Big Data'' slogan. At a planet scale, we are today generating more data than we can store, and most of it will never be seen by human eyes. It often takes the form of sequences or streams of data items, arriving at high speed, potentially infinite, and evolving over time. The data stream paradigm contrasts with the usual input-compute-output algorithmic paradigm in this sequential nature, and also in the strong computational requirements it imposes: one pass over the data, small memory, small computation time per data item, ability to give answers in real-time and at anytime.
In this seminar we will:
- Describe some of the scenarios where this paradigm is necessary (sensor networks, smart cities, social media, network monitoring, ...).
- Study and experiment with algorithms for computing basic and not so basic queries over data streams.
- Study and experiment with algorithms for mining knowledge from data streams (predictive models, clustering, pattern mining), as an extension of traditional data mining and machine learning.
present the main ideas and where we will go over the solutions of problem sets
distributed in the previous sessions and 2) lab sessions, where we will experiment
or implement some of the methods described; for Part II (data stream mining) the
MOA stream mining framework will be used.
Very preliminary syllabus
Part I: Algorithms for Data Streams
- The Data Stream Model: Scenarios and definition
- Computing statistics on streams
- Matrix and graph sketches
- Detecting change on streams
- Predictive models
- Clustering
- Mining frequent patterns
- Evaluation
See the pages with Slides, Bibliography and Links, and Lab material to the right.
Subscribe to:
Posts (Atom)