Master Seminar "Recent Developments in Multi-Modal models" (WS 2023/24)

News

All information about the event is offered exclusively in Moodle.

Organization

Contact: Prof. Dr. Thomas Seidl, Rajat Koner

Required: Lecture "Knowledge Discovery in Databases I" and Lecture "Machine Learning" or equivalent.
Please indicate in the central registration form Moodle in which semesters you attended these courses.

Audience: The lecture is directed toward Master students in Mediainformatics, Bioinformatics, and Informatics as well as Data Science

Registration: central allocation on Moodle

Time and Locations

We will announce further details at the kickoff meeting.

Title	Time	Location
Kickoff	January 25th, 2024, 16:00-17:30 h	Oettingenstr. 67, Room 157
Introducion	February 8th, 2024, 16:00-17:30 h	Oettingenstr. 67, Room 157
Final Presentation	March 5th, 2024, 9:00-16:00 h	Oettingenstr. 67, Room 169

Content

Our world and perception consist of various modalities ranging from image, video, sound, and text to senses, which make a complete picture of our surroundings. Similarly, recent trends in deep learning focus on multi-modal data beyond Vision and NLP models. Current advancement of large multi-modal like Pali[1], ImageBind[2], and Segment Anything[3] shows visual features conjugation with text, depth, and 3D features, showing remarkable robustness and scalability across domains. A large model like Pali-X shows extraordinary abilities to understand images, videos, and documents and could achieve state-of-the-art results on OCR, VQA, and few or zero-shot generalizations. On the other hand, models like Segment Anything could detect thousands of concepts based on user prompts.

This seminar will explore various multimodal model and the recent trends ranging from their architecture, training, and type of modality. We will also study their advantage and how they can be more generalizable toward a seen/unseen in-context understanding of their prospects.

[1] Chen, Xi, et al. "PaLI: A Jointly-Scaled Multilingual Language-Image Model" *arXiv preprint arXiv:2209.06794* (2023).
[2] Kirillov, Alexander, et al. "Segment Anything." *arXiv preprint arXiv:2304.02643* (2023).
[3] Girdhar, Rohit, et al. "Imagebind: One embedding space to bind them all." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.
[4] Ye, Qinghao, et al. "mplug-owl: Modularization empowers large language models with multimodality." arXiv preprint arXiv:2304.14178 (2023).

Additional Information:

How to read a scientific paper

Presentation:

Templates:

Link

Search

Links and Functions

Breadcrumb Navigation

Main Navigation

Content

Master Seminar "Recent Developments in Multi-Modal models" (WS 2023/24)

News

Time and Locations

Content

Additional Information:

Footer