Release of Robin v1.0 - a Suite of Multimodal Models

The Robin team is proud to present Robin, a suite of Multimodal (Visual-Language) Models. These models outperform, or perform on par with, the state of the art models of similar scale.

In the ever-evolving realm of artificial intelligence, the intersection of language understanding and visual perception has paved the way for groundbreaking multimodal models. We study different components and methods to merge pretrained vision and language models with the goal to build better visual language models.

As part of this first milestone, we release this LLaVA-fork enabling the Mistral-7B & Open-Hermes-2.5 language models to process images. We combine the pretrained LLMs (Vicuna, Mistral and OpenHermes 2.5) and Vision models (CLIP and SigLIP), and further enhance capabilities by finetuning the vision encoder.

Models, code and related paper on Ethical multimodal systems

LLaVA Architecture Overview

The LLaVA architecture, an acronym for Large Language and Vision Assistant, represents a multimodal Visual Language Model (VLM). At its core, LLaVA integrates a pretrained language model with a pretrained vision encoder, connected through a projection layer. In its original incarnation, the Vicuna model served as the language foundation, while the CLIP ViT-Large from OpenAI assumes the role of the vision encoder.

Building upon this foundation, as part of the first milestone we study the impact of different language models, vision encoders and the effect of finetuning the vision encoder on the performance of our multimodal model. Notably, our journey led us to experiment with the fusion of various versions of the Mistral AI LLM model and the DeepMind SigLip visual encoder.

Architecture Variations

Our model variations are best encapsulated in the table below, outlining the diverse combinations of language models, vision encoders and the fine-tuning strategy.

* these are the original LLaVA models (https://llava-vl.github.io/)

More models coming soon!

Model Training

Our experiments use the same multimodal instruction-following dataset that was used to train the original LLaVA.

The training process unfolded in two distinct phases, emphasizing the interplay between language understanding and visual perception:

Projection Layer Training: In the initial phase, we dedicated our efforts to training the projection layer, holding both the vision and language models in a frozen state.
Finetuning Phase: Transitioning to the finetuning phase, we unfroze the projection layer and the vision encoder in most cases. We adapt the language model using LoRA. However, for one of the models OpenHermes + SigLIP (VE frozen), we also test keeping the visual encoder frozen while training the language model and projection layer.

Results

Let's delve into the analysis of the results obtained from our diverse set of models, examining their performance on GQA (Visual Question Answering), Semantic Question Answering (SQA) Text, and SQA Image benchmarks.

GQA Performance

GQA assesses the model's ability to answer questions related to images, requiring a deep understanding of both textual queries and visual context.

Top Performer: The model "Vicuna + CLIP" stands out as the top performer in GQA with a score of 62.04. We notice that finetuning the vision encoder gives the model a performance boost over the baseline LLaVA 7B model, demonstrating a robust comprehension of visual questions, showcasing the impact of strategic training techniques.

SQA Performance

SQA evaluates the models' capabilities in comprehending and responding to queries, both in textual and image-based contexts.

Overall Competence: The model "OpenHermes + SigLIP" emerges as a standout performer across both SQA Text and SQA Image benchmarks, boasting scores of 79.56 and 74.22, respectively. This model's comprehensive understanding of both textual and visual information highlights the success of combining Mistral-7B-v0.1 with the SigLIP visual encoder.
Impact of fine-tuning the Visual Encoder: We notice that finetuning the vision encoder gives the model a performance boost over the baseline LLaVA 7B model.
Impact of the base foundation models: We notice that using better pretrained components significantly improves the performance.

More benchmark evaluations consisting of both Language and Vision-Language benchmarks are coming soon!

Examples

More detailed and grounded descriptions

Less hallucinations

Better reasoning

Conclusions

As part of the first milestone of the Robin Effort, we studied the impact of different language models, and vision encoders. We found that fine-tuning the vision encoder helped further enhance capabilities and improve performance. All the models and code have been released above. Future models with multi-image support, enhanced visual reasoning, video processing capabilities and licenses permitting commercial usage coming soon!

Acknowledgements

We would like to thank Hessian-AI for providing us with free access to 8-16 A100 GPUs for a few weeks and to Florian and Patrick at Hessian AI for their support. We would also like to thank Oak Ridge Leadership Computing Facility (OLCF), the DOE Office of Science User Facility. Prelimnary experiments were conducted on the INCITE compute grant on Summit supercomputer supported under Contract DE-AC05-00OR22725. This grant was awarded to AAI CERC lab for their Scalable Foundation Models for Transferrable Generalist AI project. This work was done in collaboration with representatives from EleutherAI.

Page updated

Google Sites

Report abuse