Lehr- und Forschungseinheit für Datenbanksysteme

Breadcrumb Navigation


Master Seminar "Recent Developments in Multi-Modal models" (WS 2023/24)


All information about the event is offered exclusively in Moodle.


  • Required: Lecture "Knowledge Discovery in Databases I" and Lecture "Machine Learning" or equivalent.
    Please indicate in the central registration form Moodle in which semesters you attended these courses.
  • Audience: The lecture is directed toward Master students in Mediainformatics, Bioinformatics, and Informatics as well as Data Science
  • Registration: central allocation on Moodle

Time and Locations

 We will announce further details at the kickoff meeting.

Title Time Location
Kickoff January 25th, 2024, 16:00-17:30 h Oettingenstr. 67, Room 157
Introducion February 8th, 2024, 16:00-17:30 h Oettingenstr. 67, Room 157
Final Presentation March 5th, 2024, 9:00-16:00 h Oettingenstr. 67, Room 169


Our world and perception consist of various modalities ranging from image, video, sound, and text to senses, which make a complete picture of our surroundings. Similarly, recent trends in deep learning focus on multi-modal data beyond Vision and NLP models. Current advancement of large multi-modal like Pali[1], ImageBind[2], and Segment Anything[3] shows visual features conjugation with text, depth, and 3D features, showing remarkable robustness and scalability across domains. A large model like Pali-X shows extraordinary abilities to understand images, videos, and documents and could achieve state-of-the-art results on OCR, VQA, and few or zero-shot generalizations. On the other hand, models like Segment Anything could detect thousands of concepts based on user prompts.

This seminar will explore various multimodal model and the recent trends ranging from their architecture, training, and type of modality. We will also study their advantage and how they can be more generalizable toward a seen/unseen in-context understanding of their prospects.

  • [1] Chen, Xi, et al. "PaLI: A Jointly-Scaled Multilingual Language-Image Model" *arXiv preprint arXiv:2209.06794* (2023).
  • [2] Kirillov, Alexander, et al. "Segment Anything." *arXiv preprint arXiv:2304.02643* (2023).
  • [3] Girdhar, Rohit, et al. "Imagebind: One embedding space to bind them all." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.
  • [4] Ye, Qinghao, et al. "mplug-owl: Modularization empowers large language models with multimodality." arXiv preprint arXiv:2304.14178 (2023).

Additional Information: