Basic Research in Informatics for Creating the Knowledge Society
ABOUT BRICKS
Background
Consortium
Organization
Boards
Funding


RESEARCH
Projects
Publications
Phd Theses
Posters


NEWS & AGENDA
News
Agenda


CONTACT
Contact
RESEARCH: PROJECTS
Click on a theme or a project in the table below for more information.
ThemesPDCMSVISAFM
ProjectsPDC1    PDC2    PDC3MSV1    MSV2    MSV3IS1    IS2    IS3    IS4/5
IS6    IS7    IS8
AFM1    AFM2    AFM3    AFM4
AFM5    AFM6    AFM7    AFM8

Project leader: Prof.dr. A. Siebes (UU)
Consortium: UU, CWI
Industrial partners (non-exhaustive): Data Distilleries, Kiminkii
Total FTE: 3.10 (heads: faculty: 7, PhD: 2)
Key BRICKS publications:
A. Siebes et al: "Item Sets that Compress" In: SDM 2006
M. van Leeuwen et al: "Compression Picks Item Sets That Matter" In: ICDM Workshops 2006
R. Bathoorn et al: "Reducing the Frequent Pattern Set" In: Proceedings of the Biennial Conference on Innovative Data Systems Research (CIDR), Asilomar, CA, USA, January 2007
A. Knobbe et al: "Maximally informative k-itemsets and their efficient discovery" In: KDD 2006
A. Knobbe et al: "Pattern Teams" In: PKDD 2006
S. Idreos et al: "Database cracking" In: CIDR 2007
Project IS2: The Petabyte Data Mining Challenge
The goal of this project is to develop database management techniques and data-mining algorithms to mine heterogeneous distributed scientific databases. To answer many of the grand science questions, such as the discovery of pathways, we need to analyze such huge and diverse data collections with the ease as if it was stored on one small desktop scientific database. The scientist responsible for generation of this avalanche of experimental data should not be bothered by the many technical problems posed of persistent storage for subsequent recall, its compression, nor its replication to guard against system failures. Likewise, the user of the scientific data should be unaware that the data itself may be globally distributed and that the processing requests are handled by hundreds of PCs. Nor should he be aware of the cost-sharing structure of the underlying data-grid. Instead, the scientist should be able to concentrate on mining the database for clues on e.g. bio-diversity, genome structures, or evolution of diseases. That is, one has to mine heterogeneous distributed scientific databases. To allow users to do this seamlessly, the underlying DBMSs have to support this and the techniques from (multi-)relational data-mining and distributed data-mining have to be combined.

The underlying database technology for this project is MonetDB, developed in cooperation with CWI. To scale to the required sizes, we emphasize a new approach of distributed database processing based on cracking. This novel technique shifts the cost of data maintenance to query time, such that it can better adapt to the workload. The basis for this approach stems from early experiences in the data-mining tools developed together with Data Distilleries, where significant overlap in large query sequences uncovered the need for adaptive indexing the areas of interest only.

Industrial cooperation
Some of the developed techniques have already been implanted in Safari, the data-mining tool of Kiminkii, one of our industrial partners. Even more important for the research is the collaboration with life scientists, both from UMC-U and from the department of biology of the UU.

International cooperation
Currently there is collaboration with the University of Antwerp to merge and compare our approaches. In the future this will be extended with the universities of Leuven and Helsinki.

Highlights 2004-2006
Research highlights
One of the main challenges for this type of data mining is the explosion in the number of patterns that occurs at multiple steps in the mining process. We have devised various methods that give a dramatic reduction of this number, while maintaining the information present in the complete set of patterns.

Economic & societal impact
Experiments are underway to test the approach on real biological data sets. The first results show the discovery of true biological knowledge.

Future work 2007-2009
The reduction in the number of patterns is just the first step in the development. The next step is to exploit these patterns in a multi-relational setting. Case studies on biological data sets will be one of the main sources of inspiration for these developments. Given that much of the data will reside on servers hosted by other organisations, one cannot assume that there are resources available to mine every table as deeply as one would want. An interesting further question is therefore: can we determine beforehand how much extra information an associated table would add to our current goal and, similarly, how much information we can extract without draining their resources? To realise a viable experimentation platform for peta-byte data-mining, the MonetDB system is extended in project IS-1 and IS-5.

IS2 Researchers funded by BRICKS

  • Prof.dr. A.P.J.M. Siebes (UU)
  • Drs. H. Philippi (UU)
  • Drs. A. Koopman (UU)
  • Prof.dr. M. Kersten (CWI)
  • Dr. N. Nes (CWI)
  • Drs. K.S. Mullender (CWI)
  • Drs. F. Groffen (CWI)

For more information, please refer to the publications and posters of this project.


© 2004-2009 BRICKS Consortium