Abstract
Data mining provides methods that help to acquire insight in a data set automatically. One of its problem areas is to select a small set of useful patterns from the huge collection of patterns that can be found in a data set. This thesis presents our results in this area.
... read more
We show that such a small set of patterns, if well-chosen, allows one to answer queries on the data set without referring to the data itself. Moreover, we show how these pattern sets allow one to built quick and scalable recommender systems. To choose such a small set of patterns, we rely on the Minimum Description Length (MDL) principle: the best model compresses the data best. More precisely, we use the code tables that the heuristic Krimp algorithm induces from the data. Our results show that these code tables are highly characteristic of the data set. Anything one wants to know about the data can be inferred from its code table. In more detail, we show how such a code table can be used to compute the answer to a query on the data set. These answers are almost always very close to the answer one gets by actually computing the query on the data itself. This similarity is verified experimentally and quantified using an asymmetric dissimilarity score which is derived from the Normalised Compression Distance. Next we show how the code tables can be used for the -- predictive -- task of tag recommendation. In particular it is shown that the proposed algorithms show a good trade-off between accuracy and time-efficiency; using the full set of patterns yields only slightly better results but requires infeasible amounts of time. In a social networking context we show how to personalize -- and thus improve -- our tag recommendations. This is achieved by using user-centred knowledge in contrast to the collective knowledge used for the general task. For quality and scalability reasons, we use `social batched personomies' by processing queries in batches, instead of individually, such as done in the standard personomy approach. In each chapter we provide extensive experimental evaluation to show that our methods perform well on a large variety of datasets. From these experiments one cannot but conclude that code tables are highly characteristic of the data.
show less