Master Seminar "Recent Developments in Multi-Modal models" (WS 2023/24)
News
All information about the event is offered exclusively in Moodle.
Organization
- Contact: Prof. Dr. Thomas Seidl, Rajat Koner
- Required: Lecture "Knowledge Discovery in Databases I" and Lecture "Machine Learning" or equivalent.
Please indicate in the central registration form Moodle in which semesters you attended these courses.
- Audience: The lecture is directed toward Master students in Mediainformatics, Bioinformatics, and Informatics as well as Data Science
- Registration: central allocation on Moodle
Time and Locations
We will announce further details at the kickoff meeting.
Title | Time | Location |
Kickoff | January 25th, 2024, 16:00-17:30 h | Oettingenstr. 67, Room 157 |
Introducion | February 8th, 2024, 16:00-17:30 h | Oettingenstr. 67, Room 157 |
Final Presentation | March 5th, 2024, 9:00-16:00 h | Oettingenstr. 67, Room 169 |
Content
Our world and perception consist of various modalities ranging from image, video, sound, and text to senses, which make a complete picture of our surroundings. Similarly, recent trends in deep learning focus on multi-modal data beyond Vision and NLP models. Current advancement of large multi-modal like Pali[1], ImageBind[2], and Segment Anything[3] shows visual features conjugation with text, depth, and 3D features, showing remarkable robustness and scalability across domains. A large model like Pali-X shows extraordinary abilities to understand images, videos, and documents and could achieve state-of-the-art results on OCR, VQA, and few or zero-shot generalizations. On the other hand, models like Segment Anything could detect thousands of concepts based on user prompts.
This seminar will explore various multimodal model and the recent trends ranging from their architecture, training, and type of modality. We will also study their advantage and how they can be more generalizable toward a seen/unseen in-context understanding of their prospects.
- [1] Chen, Xi, et al. "PaLI: A Jointly-Scaled Multilingual Language-Image Model" *arXiv preprint arXiv:2209.06794* (2023).
- [2] Kirillov, Alexander, et al. "Segment Anything." *arXiv preprint arXiv:2304.02643* (2023).
- [3] Girdhar, Rohit, et al. "Imagebind: One embedding space to bind them all." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.
- [4] Ye, Qinghao, et al. "mplug-owl: Modularization empowers large language models with multimodality." arXiv preprint arXiv:2304.14178 (2023).
Additional Information:
Presentation:
Templates: