(L)What is VLMS - when anyone not only "read", but also "looking" the world as humans?

I. What is the visual language model?

Visual language models (Vision Language Models – VLMs) are multimodal AI systems (multimodal AI systems) Built by combining a large language model (Large Language Model – LLM) With an image coding kit (vision encoder), Help LLM have the ability to "see"

With this ability, VLMS can handle and understand the content diverse in many different video formats, Photos and texts from which to create natural feedback.

Applications of the visual language model

Unlike traditional computer visual models, VLMS is not limited by a fixed class or a specific task as classified (classification) or identify (detection). VMLS is trained on a huge data set of pairs describing the same image or video (image/video-caption pairs). This allows VLMS to be trained in natural language (natural language) And perform not only traditional visual tasks but also the tasks of creation such as summarizing content or visual questions and answers (Visual QnA).

II. Why is the visual language model important?

To understand the importance of the visual language model (Vision Language Models – VLMs), First need to consider how traditional computer visual models (Computer Vision – CV) work. Traditional CV models based on neural network (Convolutional Neural Networks – CNNs) Often trained a specific task on a fixed class set. For example:

A classification model (classification model) Can determine whether the image contains a cat or a dog.
A CV model identifies and reading optical characters (OCR) The text can be extracted from the image but cannot understand the layout or data visually in the document.

Previous CV models can only perform the tasks they are trained, Can not expand to other tasks or identify new class sets without training. If there is a change in the use requirement or need to add a new class, The developer must collect and label a large number of images, Then re -train the model - a very expensive and time -consuming process. Besides, Traditional CV models are not able to understand natural language (natural language understanding).

VLMS opens a new era by combining the power of the platform model (foundation models) Like a clip with large language models (Llms), Help them have the ability to look and understand the language. From the beginning, VLMS has impressive zero-shot performance on many visual tasks such as visual questions and answers (visual question-answering), classify (classification) and Identify optical characters (OCR). They are also extremely flexible, Not limited to a fixed class but can be applied to almost any case by changing the command (text prompt).

The use of VLMS is similar to working with a LLM. The person who entered the command and could be attached to the image. The system will process inputs to create written feedback. Users can ask questions, Summary requirements, Explain content or analyze images in conversation context. Besides, VLMS can be integrated into visual agents (visual agents) To automatically perform visual -related tasks.

III. How does the visual language model work??

Most visual language models (VLMS VLMS VLMS) Following the architecture consists of three main parts:

Visual encryption kit (Vision Encoder) Usually a clip -based model with Transformer architecture, Trained on millions of images-copies. Thanks to that, It can link the image to the language.
Mat (Projector) Playing the role of converting the output of the visual encoder into the form that LLM can understand, Usually image tokens (image tokens). This ingredient can simply be a linear layer (linear layer) As in Llava and Vila, or more complicated as the layers of diagonal attention (cross-attention layers) Used in llama 3.2 Vision.
Large language model (Large Language Model – LLM) - Any LLM model can be used to build a VLM. Currently there are hundreds of variants of VLMS, Created by combining many different llm with visual encoder.

The common three -part architecture of the visual language model

IV. How is the visual language model trained?

VLMS is trained through many stages, Including training (pretraining), After that, it is surveillance (supervised fine-tuning). Besides, It is possible to apply effective parameters (Parameter Efficient Fine-Tuning – PEFT) To create a specialized VLM in each field based on custom data.

1. Training (Pretraining)

The goal of this period is to move the set of visual encoder, mat (projector) and large language model (LLM) , Help them can "generally a language" when handling text and images. This process uses a large amount of data including alternating text-image pairs. When these three components were well aligned, VLM will continue the surveillance period.

2. Surveillance (Supervised Fine-Tuning)

In this step, VLM is trained to signal how to respond to the requirements of the user. The input data is a collection of sample statements with images/text and desired feedback.. For example, The model may be required to describe the content in the image or count the number of objects in the frame. After this period, VLM will learn to explain more accurate images and feedback in accordance with the context.

3. How does VML work after training?

After being trained, VLM can be used as LLM, Allow users to enter the statement can be attached to the image. The model will analyze input data and create feedback in the form of text . Thông thường, VLMS is deployed in the form of Openai's API Rest to easily integrate into applications.

Currently, More advanced techniques are being studied to improve VLMS image processing ability, bao gồm:

Combining many visual encoder For better image analysis.
Divide high resolution images into smaller parts for more efficient handling.
Increase the length of context To help VMLS understand the long video correctly.

These advances are helping VLMS beyond the limit of handling single images. Now, They can compare and analyze many images at the same time, Read and understand the text in the image more accurately, Long video processing and better space awareness.

V. How is the visual language model evaluated?

There are many common evaluation standards such as Mmwanness, Video-MME, Mathvista, Chartaqa, and docvqa Used to evaluate the performance of the visual language models (VLMS VLMS VLMS) On many different tasks, bao gồm:

Questions and Answers (Visual question-answering)
Argument and logical reasoning (Logic and reasoning)
Understand the content of the document (Document understanding)
Compare many images (Multi-image comparisons)
Video analysis and understanding (Video understanding)

Examples of multiple -choice questions for VLMS used in MMMU reviews. Nguồn (Mmwanness)

How to operate the evaluation criteria:

Most of these assessments include a collection of images with many related questions, Usually in the form of multiple choice. Test format makes it easy to evaluate and compare the performance of the VLMS consistently. These questions are designed to check the cognitive ability, Model understanding and reasoning.

How to calculate the performance score of VLM:

When performing the review, VMLS will receive inputs including images, Questions and answers. The task of the model is to choose the correct answer. The accuracy of VMLS is calculated based on the correct answer rate compared to the total number of multiple choice questions.

Some sets of standards also include questions that require a model of arithmetic calculation and give results within the permitted errors to be considered right.. The questions and images in these tests are often taken from academic sources such as textbooks at university level.

VI. Application of the visual language model (VLMS VLMS VLMS)

VLMS is quickly becoming a leading tool in visual -related tasks thanks to its flexible ability and understanding the natural language.. Only with written statements, VMLS can perform many different tasks as:

Questions and Answers
Summary of images and videos
Document identification and analysis, Handwritten document

Trước đây, To handle these tasks, Need to combine many different specialized models. Now, A single vlm can be in charge.

Application in education

VLMS is especially excellent in summarizing the image content. For example, In the field of education, A VLM can identify the image containing the handwritten problem, Then use OCR and reason to understand the problem, At the same time, providing step -by -step instructions to solve the lesson. Not only read the content, VMLS can also deduce and perform specific tasks as required.

AI agents analyze video data conversion and video data into practical information

Application in video analysis

The amount of video data created every day is very large, making the review and extraction of information become impossible if done manually. VLMS can be integrated into the Video Aircales systems to identify important events when required as such as:

In the warehouse, VMLS can detect a robot with a core or give a warning when shelves)
In traffic, Smart monitoring system can identify and warn risks like pouring trees, The car is dead or an accident.

VMLS's analysis does not stop at image recognition, but can also analyze and create automatic reports, Help improve monitoring and management performance.

Application in long video analysis

VMLS can be combined with the graph database (graph databases) To understand long video content, Help to identify complicated events and relationships in the video. Thanks to that they can be applied to:

Optimize operation in warehouse (Find the bottlenecks, Improve operating efficiency)
Analyze and create automatic sports comments
Thanks to these progress, VLMS is not only effective in solid image content but also analytical, synthesize and create valuable information in many fields.

VII. Challenges of the visual language model (VLMS VLMS VLMS)

Despite growing rapidly, VLMS still encountered some limitations, Especially in capabilities Understand the space (spatial understanding) and Video processing with long context (long-context video understanding).

1. Restriction on input size and small detailed identification ability

Most VLMSs now use models based on clips as visual encoder (vision encoder) Capital is limited at the input size 224 × 224 or 336 × 336. This makes the model difficult in identifying small or sophisticated details.

For example, A HD 1080 × 1920 frame from video, Need to be reduced or trimmed before putting into the model that loses many important details. To overcome, Small spending methods (titling) Being studied to separate a large image into small parts for better processing. Besides, There are also studies on the use of higher resolution images

2. Limitations in determining the exact position of the object

VML has difficulty providing the exact position of the object in the image. The main reason is the training data set of the clip mainly containing brief descriptions of images, as noted (caption), without providing detailed information about the position of the objects. This reduces the ability to understand the space of the clip and VML "inheritance".. Currently, Some studies are trying to combine multiple images to improve this ability.

3. Limitations in understanding long videos

Long video processing is a big challenge because VMLS needs to consider image information for a long time to make accurate analysis. Similar to the LLMS, VMLS is limited in terms of context length, It means that only a certain frame can be put into the model to answer the question.

The methods to expand the context and train VMLS on the richer video data are being studied, Nhu Longvila - a model focusing on long video processing.

4. Limitations in specialized fields

VLMS can not be trained enough data for very specific use cases, Such as detecting production errors in a specific production line.

Can overcome by tweaking (Fine-tuning)Model on specialized data sets to improve accuracy.
Use VMLS in combination with contextual learning (In-context learning) To provide examples to help the model quickly without training.
Application of effective PEFT parameters (Parameter-Efficient Fine-Tuning) To improve the accuracy of VMLS on custom data.

Nguồn: Nvidia

See more:

Share knowledge

(L)What is VLMS - when anyone not only "read", but also "looking" the world as humans?

I. What is the visual language model?

II. Why is the visual language model important?

III. How does the visual language model work??