Fusing Vision and Language
A Tutorial on Vision-Language Models for Multimodal Content Analysis
Summary
The increasing availability of multimodal data, including images and videos, has led to a surge of interest in multimodal models that combine visual and textual information. This tutorial will provide an in-depth introduction to the latest advances in multimodal models, with a focus on large vision-language models. Through a combination of theoretical explanations, code demonstrations, and hands-on exercises, participants will learn how to apply these models to a range of image and video analysis tasks, including image captioning, visual concept detection, and image retrieval. By the end of the tutorial, attendees will have a solid understanding of the strengths and limitations of these models, enabling them to implement their own multimodal applications.
Program
This tutorial will take half a day (a total of 4 hours) and will be presented in person. The tutorial slides can be found here, and all associated notebooks are listed at the bottom of the page.
- 13:30 - 13:45 Welcome Session
- 13:45 - 15:30 From Language and Vision to Vision-Language Models
- 15:30 - 16:00 Coffee Break
- 16:00 - 17:30 Generative AI and Video Analysis
- 17:30 - 18:00 Discussion and Closing Session
Welcome Session
- Welcome
- Background of the Research Group Visual Analytics
- Resources: Sharing tutorial resources with participants
From Language and Vision to Vision-Language Models
Overview Natural Language Processing (NLP)
- Recurrent Neural Network (RNN)
- Long Short-Term Memory (LSTM)
- Attention Mechanism & Transformer
Overview Computer Vision (CV)
- Convolutional Neural Network (CNN)
- Vision Transformer (ViT)
“Fusion” of Text and Images
- Types of Multimodal Fusion Strategies
- Language-supervised Learning (CLIP: Contrastive Language-Image Pretraining)
Demo Session
Classification Tasks
- What happens if the image only show concepts that are not in the dictionary?
- Experiment with negative prompts alongside positive prompts, e.g., “not a photo of a cat”, “photo of a dog”
- Experiment with synonymous prompts, e.g., “a photo of a cat”, “a photo of a kitty”
- Experiment with prompts with visual attributes, e.g., “a red car”, “a black car”
Retrieval Tasks
- Retrieve images based on 3-5 query images
- Are the results meaningful?
- What kind of similarity do you think CLIP is measuring?
- Retrieve images using ~5 simple prompts (This is a photo of
) - How would you rate the results?
- Are the results intuitive and explainable?
- Try to formulate more advanced prompts, e.g., by adding a short description of the concept
- Do the results improve?
- Can these act as a filter for the search results?
- What can we do to improve search results and explainability?
- Check out our demos iART (https://iart.vision) and iPatent (https://service.tib.eu/ipatent)
Domain Adaptation Finetuning CLIP with LoRA
Generative AI and Video Analysis
Generative AI
- From Large Language Models (LLMs) to Large Vision-language Models (LVLMs)/Multimodal LLMs (MLLMs)
- Foundations, use cases, and benchmarks of MLLMs
Demo Session on Large Vision-Language Models (LVLMs)
Links to LVLMs
- Qwen2.5-VL: https://huggingface.co/spaces/Qwen/Qwen2.5-VL-72B-Instruct
- InternVL: https://huggingface.co/spaces/OpenGVLab/InternVL *seems to be down today :(
- IntructBLIP: https://huggingface.co/spaces/hysts/InstructBLIP
- GLM 4.5VL: https://chat.z.ai/
Jupyter Notebook on Ollama: https://colab.research.google.com/drive/1SuUxHVAewvT-rMJr-dz5LnlEFCDxArV6
Tasks
- Pick several tasks from the MMBench benchmark dataset and assess the quality of results!
- Compare multiple LVLMs such as InstructBLIP, Qwen2.5-VL, and InternVL!
- Compare the results for various prompt types!
- What are the advantages and disadvantages of the prompt types?
- Try multiple choice prompts but change the order of the options. What do you notice?
- Try to extract overlaid text from images using LVLMs!
- How would you rate the quality of the output?
- Can you extract equations from lecture videos?
- Try using the overlaid text as additional context for task 1!
- Try to extract structured information from data visualizations (e.g., bar charts) with LVLMs!
From Images to Videos
- Temporal Pooling
- X-CLIP
- Video Transformers
- Video LLMs
Demo Session on TIB AV-Analytics (TIB-AV-A)
Link to TIB AV-Analytics: https://service.tib.eu/tibava Example videos: https://tib.eu/cloud/s/Jyj3spaGmtHZ32z
Discussion and Closing Session
- Questions from the participants
- Networking
Contact
Eric Müller-Budack
Email: eric.mueller@tib.eu
Eric Müller-Budack is leading the Visual Analytics Research Group at TIB – Leibniz Information Centre for Science and Technology. He received his PhD from the Leibniz University Hannover in 2021. His main research interests include automatic multimedia indexing, multimedia and multimodal information retrieval, and deep learning for multimedia analysis and retrieval.
Sushil Awale
Email: sushil.awale@tib.eu
Sushil Awale is a research associate in the Visual Analytics research group at TIB – Leibniz Information Centre for Science and Technology and a third-year PhD student at Leibniz Universität Hannover. His research interests center on multimodal modeling and information retrieval.