Knowledge Discovery and Data Mining I (WS 2018/19)

Aktuelles

[22.03.19] The exam inspection for the second exam is set to Fr, 29.03.19, 10:00 - 11:30 in Oettingenstr. 67, room 157.
[14.03.19] The distribution for the second exam is as follows:

A-L: M 218 (HGB, Geschw.-Scholl-Pl. 1)
M-Z: A 240 (HGB, Geschw.-Scholl-Pl. 1)
Nachteilsausgleich: 157 (Oettingenstraße 67)

[18.02.19] Please read the guidelines for the exam. The distribution is as follows:

A-E: A 140 (HGB, Geschw.-Scholl-Pl. 1)
F-M: B 101 (HGB, Geschw.-Scholl-Pl. 1)
N-Z: B 201 (HGB, Geschw.-Scholl-Pl. 1)
Nachteilsausgleich: 157 (Oettingenstraße 67)

[02.02.19] We have uploaded the full slide set in a single PDF here
[07.02.19] Exercise 7.2 (OPTICS) contained a mistake, where the core-dist of the target node was used for the reach-dist instead of the source. We have fixed that an updated the solution.
[01.02.19] Please send us your questions for the QA-session in advance via e-mail to berrendorf@dbs.ifi.lmu.de before 04.02.19, 12:00, so we have some time to prepare them.
[28.01.19] The date for the second exam has been fixed to Mo, 18.03.19, 16:00 - 18:00 in M218 and A240 (HGB, Geschw.-Scholl-Pl. 1).
[28.01.19] The exam inspection for the first exam is set to Tu, 12.03.19, 10:00 - 11:30 in Oettingenstr. 67, room 157.
[13.12.18] If you are eligigle for a compensation of disadvantage (Nachteilsausgleich) for the exam, such as a writing time extension (Schreibzeitverlängerung), please inform us in advance, at the latest until 11.02.2019.
[07.11.18] Please note: The lecture is being recorded. You can find the videos here.
[02.11.18] We have edited the slide template, such that the slide numbers are slightly shifted. References to slide numbers have been updated accordingly.
[25.10.18] We have uploaded an additional programming exercise on data exploration. Please note: The programming exercises will not be presented in live coding sessions, but there will be time in the tutorials for you to ask and discuss questions.
[24.10.18] There are no tutorials on 01/02. November due to All saints' Day.
[15.10.18] The registration for tutorials is open. Please register for a tutorial you are likely to attend, so we get an overview of the number of participants.
[15.10.18] The registration for the exam is open. Please register in UniWorX.
[12.10.18] We have uploaded a Python introduction including solutions for it. There will be no live-coding session, but you can ask and discuss questions in the tutorials.
[08.10.18] The date for the first exam has been fixed.
[08.10.18] Please register in UniWorX for this course.

Organisation

Umfang: 3+2 Semesterwochenstunden
Dozent: Prof. Dr. Thomas Seidl
Vorkenntnisse: Algorithmen und Datenstrukturen empfohlen, Datenbanksysteme I vorteilhaft.

Anmeldung: über UniWorX
Übungsleitung: Julian Busch, Max Berrendorf

Klausur:

Veranstaltung	Zeit	Ort
Hauptklausur	Mo, 25.02.19, 14:00 - 16:00	B 101 B 201 A140 (HGB, Geschw.-Scholl-Pl. 1)
Nachholklausur	Mo, 18.03.19, 16:00 - 18:00	M218 A240 (HGB, Geschw.-Scholl-Pl. 1)

Termine und Ort

Veranstaltung	Zeit	Ort	Beginn
Vorlesung	Di, 9:15 - 11:45 Uhr	Raum B U101 (Oettingenstr. 67)	16.10.2018
Übung 1	Do, 12:15 - 13:45 Uhr	Lehrturm-VU107 (Prof.-Huber-Pl. 2)	25.10.2018
Übung 2	Do, 14:15 - 15:45 Uhr	Lehrturm-VU107 (Prof.-Huber-Pl. 2)	25.10.2018
Übung 3	Fr, 12:15 - 13:45 Uhr	Lehrturm-V005 (Prof.-Huber-Pl. 2)	26.10.2018
Übung 4	Fr, 14:15 - 15:45 Uhr	Raum C 111 (Theresienstr. 41)	26.10.2018

Zeitplan und Material

Vorlesung		Übung
Datum	Inhalt	Datum	Inhalt	Lösungen
16.10.18	1.1 Introduction 2.1 Data Representation (until slide 37)	---	00_python_introduction.ipynb moviemetadata.csv	00_python_introduction_solutions.ipynb 00_python_introduction_solutions.html
23.10.18	2.1 Data Representation (from slide 38) 2.2 Data Reduction (until slide 84)	25.10.18 26.10.18	exercise_01.pdf	solution_01.pdf
30.10.18	2.2 Data Reduction (from slide 85) 2.3 Data Visualization 2.4 Privacy 3.1 Frequent Pattern Mining (until slide 128)	01.11.18 02.11.18	no tutorials (All saints' Day)
06.11.18	3.1 Frequent Pattern Mining (from slide 129 until 160)	08.11.18 09.11.18	exercise_02.pdf 02_data_exploration.ipynb ChicagoCrime2017.csv.xz	solution_02.pdf 02_data_exploration_solutions.ipynb 02_data_exploration_solutions.html
13.11.18	3.1 Frequent Pattern Mining (from slide 161) 3.2 Clustering (until slide 194)	15.11.18 16.11.18	exercise_03.pdf	solution_03.pdf
20.11.18	3.2 Clustering (from slide 195 until 231)	22.11.18 23.11.18	exercise_04.pdf	solution_04.pdf
27.11.18	3.2 Clustering (from slide 232 until 266)	29.11.18 30.11.18	exercise_05.pdf	solution_05.pdf
04.12.18	(cancelled)	06.12.18 07.12.18	exercise_06.pdf	solution_06.pdf solution_06_slides_dbscan.pdf
11.12.18	3.2 Clustering (from slide 267 until 296)	13.12.18 14.12.18	exercise_07.pdf	solution_07.pdf
18.12.18	3.2 Clustering (from slide 297) 3.3 Outlier Detection	20.12.18 21.12.18	exercise_08.pdf	solution_08.pdf
Weihnachtsferien
08.01.19	4.1 Classification (until slide 376)	10.01.19 11.01.19	09_clustering.ipynb large.csv.xz small.csv.xz	09_clustering_solutions.ipynb 09_clustering_solutions.html
15.01.19	4.1 Classification (from slide 377 until 416)	17.01.19 18.01.19	exercise_10.pdf	solution_10.pdf
22.01.19	4.1 Classification (from slide 417 until 453)	24.01.19 25.01.19	exercise_11.pdf	solution_11.pdf
29.01.19	4.1 Classification (from slide 454) 4.2 Regression	31.01.19 01.02.19	exercise_12.pdf	solution_12.pdf
05.02.19	5.1 Process Mining (1) 5.1 Process Mining (2) 5.2 Outlook	07.02.19 08.02.19	13_classification.ipynb 13_classification_linear.html 13_classification_circles.html 13_classification_moons.html

Inhalt

Die in vielen Anwendungsgebieten stark angewachsenen Datenmengen machen eine manuelle Analyse der angefallenen Information zunehmend schwierig, wenn nicht sogar unmöglich. Gerade in Bereichen wie der Auswertung biologischer Messverfahren (Gen-Sequenzierung, Micro-Array Verfahren ...) oder von Transaktionsdaten großer Telekommunikations- oder Netzbetreiber, ist eine Nutzung der Daten ohne die Zuhilfenahme computergestützter Verfahren nicht denkbar.

Mit der Lösung dieser Probleme beschäftigt sich daher das Forschungsgebiet "Knowledge Discovery in Databases (KDD)". Es verbindet dabei Aspekte der Statistik, des maschinellen Lernens sowie der Datenbanksysteme und behandelt Methoden zur (semi-)automatischen Extraktion von gültigem, neuem und potentiell nützlichem Wissen aus großen Datenbanken. Der in diesem Zusammenhang häufig verwendete Begriff Data Mining bezieht sich dabei auf den grundlegenden Schritt im KDD-Prozess, in dem die eigentliche Analyse der Daten durchgeführt wird.

Data Mining wird dabei häufig auch auf große Mengen betrieblicher Daten angewendet, die in so genannten Data Warehouses gesondert verwaltet werden. Der häufig verwendete Begriff Business Intelligence beschreibt dann unter anderem die Anwendung von Data Mining Algorithmen auf die von einem Data Warehouse bereitgestellten Informationen, um zielgerichtet Entscheidungsprozesse zu unterstützen.

Die Vorlesung gibt einen Überblick über die Grundlagen der wichtigsten KDD-Techniken. Dabei wird besonders auf die folgenden Teilgebiete eingegangen: Klassifikation, Regression/Trenderkennung, Clustering, Outlier Detection und Assoziationsregeln.

Zur Vertiefung der Vorlesung werden 2-stündige Übungen angeboten, in denen die vorgestellten Verfahren weiter erläutert und an praktischen Beispielen veranschaulicht werden.

Search

Links and Functions

Breadcrumb Navigation

Main Navigation

Content