Fusing Vision and Language

A Tutorial on Vision-Language Models for Multimodal Content Analysis

Summary

The increasing availability of multimodal data, including images and videos, has led to a surge of interest in multimodal models that combine visual and textual information. This tutorial will provide an in-depth introduction to the latest advances in multimodal models, with a focus on large vision-language models. Through a combination of theoretical explanations, code demonstrations, and hands-on exercises, participants will learn how to apply these models to a range of image and video analysis tasks, including image captioning, visual concept detection, and image retrieval. By the end of the tutorial, attendees will have a solid understanding of the strengths and limitations of these models, enabling them to implement their own multimodal applications.

Program

This tutorial will take half a day (a total of 4 hours) and will be presented in person. The tutorial slides can be found here, and all associated notebooks are listed at the bottom of the page.

13:30 - 13:45 Welcome Session
13:45 - 15:30 From Language and Vision to Vision-Language Models
15:30 - 16:00 Coffee Break
16:00 - 17:30 Generative AI and Video Analysis
17:30 - 18:00 Discussion and Closing Session

Welcome Session

Welcome
Background of the Research Group Visual Analytics
Resources: Sharing tutorial resources with participants

From Language and Vision to Vision-Language Models

Overview Natural Language Processing (NLP)

Recurrent Neural Network (RNN)
Long Short-Term Memory (LSTM)
Attention Mechanism & Transformer

Overview Computer Vision (CV)

Convolutional Neural Network (CNN)
Vision Transformer (ViT)

“Fusion” of Text and Images

Types of Multimodal Fusion Strategies
Language-supervised Learning (CLIP: Contrastive Language-Image Pretraining)

Demo Session

Classification and Retrieval using CLIP

Classification Tasks

What happens if the image only show concepts that are not in the dictionary?
Experiment with negative prompts alongside positive prompts, e.g., “not a photo of a cat”, “photo of a dog”
Experiment with synonymous prompts, e.g., “a photo of a cat”, “a photo of a kitty”
Experiment with prompts with visual attributes, e.g., “a red car”, “a black car”

Retrieval Tasks

Retrieve images based on 3-5 query images
1. Are the results meaningful?
2. What kind of similarity do you think CLIP is measuring?
Retrieve images using ~5 simple prompts (This is a photo of
)
1. How would you rate the results?
2. Are the results intuitive and explainable?
Try to formulate more advanced prompts, e.g., by adding a short description of the concept
1. Do the results improve?
2. Can these act as a filter for the search results?
What can we do to improve search results and explainability?
Check out our demos iART (https://iart.vision) and iPatent (https://service.tib.eu/ipatent)

Domain Adaptation Finetuning CLIP with LoRA

Generative AI and Video Analysis

Generative AI

From Large Language Models (LLMs) to Large Vision-language Models (LVLMs)/Multimodal LLMs (MLLMs)
Foundations, use cases, and benchmarks of MLLMs

Demo Session on Large Vision-Language Models (LVLMs)

Links to LVLMs

Qwen2.5-VL: https://huggingface.co/spaces/Qwen/Qwen2.5-VL-72B-Instruct
InternVL: https://huggingface.co/spaces/OpenGVLab/InternVL *seems to be down today :(
IntructBLIP: https://huggingface.co/spaces/hysts/InstructBLIP
GLM 4.5VL: https://chat.z.ai/

Jupyter Notebook on Ollama: https://colab.research.google.com/drive/1SuUxHVAewvT-rMJr-dz5LnlEFCDxArV6

Tasks

Pick several tasks from the MMBench benchmark dataset and assess the quality of results!
Compare multiple LVLMs such as InstructBLIP, Qwen2.5-VL, and InternVL!
Compare the results for various prompt types!
1. What are the advantages and disadvantages of the prompt types?
2. Try multiple choice prompts but change the order of the options. What do you notice?
Try to extract overlaid text from images using LVLMs!
1. How would you rate the quality of the output?
2. Can you extract equations from lecture videos?
3. Try using the overlaid text as additional context for task 1!
Try to extract structured information from data visualizations (e.g., bar charts) with LVLMs!

From Images to Videos

Temporal Pooling
X-CLIP
Video Transformers
Video LLMs

Demo Session on TIB AV-Analytics (TIB-AV-A)

Link to TIB AV-Analytics: https://service.tib.eu/tibava Example videos: https://tib.eu/cloud/s/Jyj3spaGmtHZ32z

Discussion and Closing Session

Questions from the participants
Networking

Contact

Eric Müller-Budack

Email: eric.mueller@tib.eu

Eric Müller-Budack is leading the Visual Analytics Research Group at TIB – Leibniz Information Centre for Science and Technology. He received his PhD from the Leibniz University Hannover in 2021. His main research interests include automatic multimedia indexing, multimedia and multimodal information retrieval, and deep learning for multimedia analysis and retrieval.

Sushil Awale

Email: sushil.awale@tib.eu

Sushil Awale is a research associate in the Visual Analytics research group at TIB – Leibniz Information Centre for Science and Technology and a third-year PhD student at Leibniz Universität Hannover. His research interests center on multimodal modeling and information retrieval.