Distributed Knowledge Discovery

Ludwig-Maximilians-Universität München
Institut für Informatik
Lehr- und Forschungseinheit für Datenbanksysteme

University of Munich
Institute for Computer Science
Database and Information Systems

Distributed and Parallel Knowledge Discovery

Objective

The goal of Knowledge Discovery in Databases (KDD) is to extract so-far unknown information from large data sources. Established KDD methods recommend that data is stored and analysed in one central database. Due to the increased use of computer networks, today there is a broad variance of distributed data sources like sensory or corporate networks.
To mine within those data sources employing the traditional approach the complete data has to be transferred to a central site for further processing. Unfortunatly this is not always applicable or even possible. The necessary bandwith might not be available or might be too expensive. Transferring critical data might be problematic with respect to data security. The privacy of costumers or other people related to the analysed data, has to be preserved. Thus, Distributed KDD yields soluations for analysing local data and recombine the results to gain global knowledge without causing massive data transfer to a central server.
A closely related topic to Distributed KDD is Parallel KDD. Parallel KDD distributes datasets over multiple processors to provide faster calculations. The modification of established data mining algortithms for parallel computation is essential when analysing very large datasets for which data mining would take several days on a single processor.

Techniques

Parallel Clustering

In order to provide faster calculation of clustering algorithms, this area examines possibilties to divide a data source, cluster the resulting parts and recombine the local clustering to find the global cluster structure within large data sets.

Distributed Clustering

The challenge of distributed clustering is to find a meaningful clustering derived from a variance of local clusterings. But unlike parallel clustering the data distribution and the algorithms that produced the local clusterings are given.

Privacy Preserving Data Mining
One of the main risks of KDD is the violotion of the privacy of persons who are somehow related to the analyzed data. For example, the results might open up possibilities to draw conclusion about some of the persons the data source is about. Privacy Preserving Data Mining tries to invent algorithm that are capable to analyze personal data without violating the privacy of the analyzed persons.

Project leader

Prof. Dr. Hans-Peter Kriegel

Team

Bei Problemen oder Vorschlägen wenden Sie sich bitte an: wwwmaster@dbs.informatik.uni-muenchen.de
Last Modified: 2003-03-25

	Parallel Clustering In order to provide faster calculation of clustering algorithms, this area examines possibilties to divide a data source, cluster the resulting parts and recombine the local clustering to find the global cluster structure within large data sets.
	Distributed Clustering The challenge of distributed clustering is to find a meaningful clustering derived from a variance of local clusterings. But unlike parallel clustering the data distribution and the algorithms that produced the local clusterings are given.
	Privacy Preserving Data Mining One of the main risks of KDD is the violotion of the privacy of persons who are somehow related to the analyzed data. For example, the results might open up possibilities to draw conclusion about some of the persons the data source is about. Privacy Preserving Data Mining tries to invent algorithm that are capable to analyze personal data without violating the privacy of the analyzed persons.