Release of Robin v1.0 - a Suite of Multimodal Models
The Robin team is proud to present Robin, a suite of Multimodal (Visual-Language) Models. These models outperform, or perform on par with, the state of the art models of similar scale.
In the ever-evolving realm of artificial intelligence, the intersection of language understanding and visual perception has paved the way for groundbreaking multimodal models. We study different components and methods to merge pretrained vision and language models with the goal to build better visual language models.
As part of this first milestone, we release this LLaVA-fork enabling the Mistral-7B & Open-Hermes-2.5 language models to process images. We combine the pretrained LLMs (Vicuna, Mistral and OpenHermes 2.5) and Vision models (CLIP and SigLIP), and further enhance capabilities by finetuning the vision encoder.
Models, code and related paper on Ethical multimodal systems
LLaVA Architecture Overview
The LLaVA architecture, an acronym for Large Language and Vision Assistant, represents a multimodal Visual Language Model (VLM). At its core, LLaVA integrates a pretrained language model with a pretrained vision encoder, connected through a projection layer. In its original incarnation, the Vicuna model served as the language foundation, while the CLIP ViT-Large from OpenAI assumes the role of the vision encoder.
Building upon this foundation, as part of the first milestone we study the impact of different language models, vision encoders and the effect of finetuning the vision encoder on the performance of our multimodal model. Notably, our journey led us to experiment with the fusion of various versions of the Mistral AI LLM model and the DeepMind SigLip visual encoder.
Architecture Variations
Our model variations are best encapsulated in the table below, outlining the diverse combinations of language models, vision encoders and the fine-tuning strategy.
* these are the original LLaVA models (https://llava-vl.github.io/)
More models coming soon!
Model Training
Our experiments use the same multimodal instruction-following dataset that was used to train the original LLaVA.
The training process unfolded in two distinct phases, emphasizing the interplay between language understanding and visual perception:
Projection Layer Training: In the initial phase, we dedicated our efforts to training the projection layer, holding both the vision and language models in a frozen state.
Finetuning Phase: Transitioning to the finetuning phase, we unfroze the projection layer and the vision encoder in most cases. We adapt the language model using LoRA. However, for one of the models OpenHermes + SigLIP (VE frozen), we also test keeping the visual encoder frozen while training the language model and projection layer.
Results
Let's delve into the analysis of the results obtained from our diverse set of models, examining their performance on GQA (Visual Question Answering), Semantic Question Answering (SQA) Text, and SQA Image benchmarks.
GQA Performance
GQA assesses the model's ability to answer questions related to images, requiring a deep understanding of both textual queries and visual context.
Top Performer: The model "Vicuna + CLIP" stands out as the top performer in GQA with a score of 62.04. We notice that finetuning the vision encoder gives the model a performance boost over the baseline LLaVA 7B model, demonstrating a robust comprehension of visual questions, showcasing the impact of strategic training techniques.
SQA Performance
SQA evaluates the models' capabilities in comprehending and responding to queries, both in textual and image-based contexts.
Overall Competence: The model "OpenHermes + SigLIP" emerges as a standout performer across both SQA Text and SQA Image benchmarks, boasting scores of 79.56 and 74.22, respectively. This model's comprehensive understanding of both textual and visual information highlights the success of combining Mistral-7B-v0.1 with the SigLIP visual encoder.
Impact of fine-tuning the Visual Encoder: We notice that finetuning the vision encoder gives the model a performance boost over the baseline LLaVA 7B model.
Impact of the base foundation models: We notice that using better pretrained components significantly improves the performance.
More benchmark evaluations consisting of both Language and Vision-Language benchmarks are coming soon!
Examples
More detailed and grounded descriptions
Less hallucinations
Better reasoning
Conclusions
As part of the first milestone of the Robin Effort, we studied the impact of different language models, and vision encoders. We found that fine-tuning the vision encoder helped further enhance capabilities and improve performance. All the models and code have been released above. Future models with multi-image support, enhanced visual reasoning, video processing capabilities and licenses permitting commercial usage coming soon!
Acknowledgements
We would like to thank Hessian-AI for providing us with free access to 8-16 A100 GPUs for a few weeks and to Florian and Patrick at Hessian AI for their support. We would also like to thank Oak Ridge Leadership Computing Facility (OLCF), the DOE Office of Science User Facility. Prelimnary experiments were conducted on the INCITE compute grant on Summit supercomputer supported under Contract DE-AC05-00OR22725. This grant was awarded to AAI CERC lab for their Scalable Foundation Models for Transferrable Generalist AI project. This work was done in collaboration with representatives from EleutherAI.