Click on a theme or a project in the table below for more information.
Project leader:
Prof.dr. A. Siebes (UU)
Consortium:
UU, CWI
Industrial partners (non-exhaustive):
Data Distilleries, Kiminkii
Total FTE: 3.10 (heads: faculty: 7, PhD: 2)
Key BRICKS publications:
| • |
A. Siebes et al: "Item Sets that Compress" In: SDM 2006
|
| • |
M. van Leeuwen et al: "Compression Picks Item Sets That Matter" In: ICDM Workshops 2006
|
| • |
R. Bathoorn et al: "Reducing the Frequent Pattern Set" In: Proceedings of the Biennial Conference on Innovative Data Systems Research (CIDR), Asilomar, CA, USA, January 2007
|
| • |
A. Knobbe et al: "Maximally informative k-itemsets and their efficient discovery" In: KDD 2006
|
| • |
A. Knobbe et al: "Pattern Teams" In: PKDD 2006
|
| • |
S. Idreos et al: "Database cracking" In: CIDR 2007
|
|
Project IS2: The Petabyte Data Mining Challenge
The goal of this project is to develop database management
techniques and data-mining algorithms to mine heterogeneous
distributed scientific databases. To answer many of the grand
science questions, such as the discovery of pathways, we need to
analyze such huge and diverse data collections with the ease as if
it was stored on one small desktop scientific database. The
scientist responsible for generation of this avalanche of
experimental data should not be bothered by the many technical
problems posed of persistent storage for subsequent recall, its
compression, nor its replication to guard against system failures.
Likewise, the user of the scientific data should be unaware that
the data itself may be globally distributed and that the processing
requests are handled by hundreds of PCs. Nor should he be aware of
the cost-sharing structure of the underlying data-grid. Instead,
the scientist should be able to concentrate on mining the database
for clues on e.g. bio-diversity, genome structures, or evolution of
diseases. That is, one has to mine heterogeneous distributed
scientific databases. To allow users to do this seamlessly, the
underlying DBMSs have to support this and the techniques from
(multi-)relational data-mining and distributed data-mining have to
be combined.
The underlying database technology for this project is MonetDB,
developed in cooperation with CWI. To scale to the required sizes,
we emphasize a new approach of distributed database processing
based on cracking. This novel technique shifts the cost of data
maintenance to query time, such that it can better adapt to the
workload. The basis for this approach stems from early experiences
in the data-mining tools developed together with Data Distilleries,
where significant overlap in large query sequences uncovered the
need for adaptive indexing the areas of interest only.
Industrial cooperation
Some of the developed techniques have already been implanted in
Safari, the data-mining tool of Kiminkii, one of our industrial
partners. Even more important for the research is the collaboration
with life scientists, both from UMC-U and from the department of
biology of the UU.
International cooperation
Currently there is collaboration with the University of Antwerp to
merge and compare our approaches. In the future this will be
extended with the universities of Leuven and Helsinki.
Highlights 2004-2006
Research highlights
One of the main challenges for this type of data mining is the
explosion in the number of patterns that occurs at multiple steps
in the mining process. We have devised various methods that give a
dramatic reduction of this number, while maintaining the
information present in the complete set of patterns.
Economic & societal impact
Experiments are underway to test the approach on real biological
data sets. The first results show the discovery of true biological
knowledge.
Future work 2007-2009
The reduction in the number of patterns is just the first step in
the development. The next step is to exploit these patterns in a
multi-relational setting. Case studies on biological data sets will
be one of the main sources of inspiration for these developments.
Given that much of the data will reside on servers hosted by other
organisations, one cannot assume that there are resources available
to mine every table as deeply as one would want. An interesting
further question is therefore: can we determine beforehand how much
extra information an associated table would add to our current goal
and, similarly, how much information we can extract without
draining their resources? To realise a viable experimentation
platform for peta-byte data-mining, the MonetDB system is extended
in project IS-1 and IS-5.
IS2 Researchers funded by BRICKS
- Prof.dr. A.P.J.M. Siebes (UU)
- Drs. H. Philippi (UU)
- Drs. A. Koopman (UU)
- Prof.dr. M. Kersten (CWI)
- Dr. N. Nes (CWI)
- Drs. K.S. Mullender (CWI)
- Drs. F. Groffen (CWI)
For more information, please refer to the publications and posters of this project.
|