Release of Robin v1.0 - a Suite of Multimodal Models

The Robin team is proud to present Robin, a suite of  Multimodal (Visual-Language) Models.   These models outperform, or perform on par with, the state of the art models of similar scale. 

In the ever-evolving realm of artificial intelligence, the intersection of language understanding and visual perception has paved the way for groundbreaking multimodal models. We study different components and methods to merge pretrained vision and language models with the goal to build better visual language models. 

As part of this first milestone, we release this LLaVA-fork enabling the Mistral-7B & Open-Hermes-2.5 language models to process images. We combine the pretrained LLMs (Vicuna, Mistral and OpenHermes 2.5) and Vision models (CLIP and SigLIP), and further enhance capabilities by finetuning the vision encoder.

Models,  code and related paper on Ethical multimodal systems

LLaVA Architecture Overview

The LLaVA architecture, an acronym for Large Language and Vision Assistant, represents a multimodal Visual Language Model (VLM). At its core, LLaVA integrates a pretrained language model with a pretrained vision encoder, connected through a projection layer. In its original incarnation, the Vicuna model served as the language foundation, while the CLIP ViT-Large from OpenAI assumes the role of the vision encoder.

Building upon this foundation, as part of the first milestone we study the impact of different language models, vision encoders and the effect of finetuning the vision encoder on the performance of our multimodal model. Notably, our journey led us to experiment with the fusion of various versions of the Mistral AI LLM model and the DeepMind SigLip visual encoder.

Architecture Variations

Our model variations are best encapsulated in the table below, outlining the diverse combinations of language models, vision encoders and the fine-tuning strategy.

* these are the original LLaVA models (

More models coming soon!

Model Training

Our experiments  use  the same multimodal instruction-following dataset that was used  to train  the original LLaVA.

The training process unfolded in two distinct phases, emphasizing the interplay between language understanding and visual perception:


Let's delve into the analysis of the results obtained from our diverse set of models, examining their performance on GQA (Visual Question Answering), Semantic Question Answering (SQA) Text, and SQA Image benchmarks. 

GQA Performance

GQA assesses the model's ability to answer questions related to images, requiring a deep understanding of both textual queries and visual context.

SQA Performance

SQA evaluates the models' capabilities in comprehending and responding to queries, both in textual and image-based contexts.

More benchmark evaluations consisting of both Language and Vision-Language benchmarks are coming soon!


More detailed   and grounded descriptions 

Less hallucinations

Better reasoning 


As part of the first milestone of the Robin Effort, we studied the impact of different language models, and vision encoders. We found that fine-tuning the vision encoder helped further enhance capabilities and improve performance. All the models and code have been released above. Future models with multi-image support, enhanced visual reasoning, video processing capabilities and licenses permitting commercial usage coming soon!


We would like to thank Hessian-AI for providing us with free access to 8-16 A100 GPUs for a few weeks and to Florian and Patrick at Hessian AI for their support. We would also like to thank Oak Ridge Leadership Computing Facility (OLCF), the DOE Office of Science User Facility. Prelimnary experiments were conducted on the INCITE compute grant on Summit supercomputer supported under Contract DE-AC05-00OR22725. This grant was awarded to AAI CERC lab for their Scalable Foundation Models for Transferrable Generalist AI project. This work was done in collaboration with representatives from EleutherAI.