LLaVA-O1: Let Vision Language Models Reason Step-by-Step

lnyan | 176 points

This quote summarizes the main secret sauce to me - once they generate a wrong token/phrase, the whole answer goes south - and it basically explains why the whole CoT approach works - prevent LLM from generating a wrong answer with 2 tricks: 1) ask LLM explicitly to generate intermediate steps instead of a final answer and 2) use beam search (filtering from several answers at each stage) to reduce the risk of picking a wrong answer even further.

Quote from this paper: “ Moreover, they (VLM) frequently deviate from a logical reasoning toward conclusions, instead of presenting a conclusion prematurely and subsequently attempting to justify it. Given that language models generate responses token-by-token, once an erroneous conclusion is introduced, the model typically continues along a flawed reasoning path.”

yalok | 4 days ago

The o1 connection is made through "Evaluation of openai o1: Opportunities and challenges of AGI"[63]—a paper mill product with 50 or so authors. They created that 280-page monstrosity in less than two weeks of the o1 release. Did I miss something? AFAIK, there's no published literature from OpenAI on o1, and nobody knows what o1 is doing exactly, but it seems the Chinese have figured it out in the matter of days... They say their model performs well on visual benchmarks, but I suspect it probably owes to them overfitting on these benchmarks in the first place.

Consider their Proposed Method:

"Each stage is initiated at the model’s discretion, without external prompt engineering frameworks or additional prompting. Specifically, we provide the model with four pairs of special tags: <SUMMARY></SUMMARY>, <CAPTION></CAPTION>, <REASONING></REASONING>, and <CONCLUSION></CONCLUSION>.

These tags correspond to summarizing the response approach, describing relevant image content, conducting reasoning, and preparing a final answer, respectively. Upon training, the model autonomously selects these tags as needed, activating each stage based on its own judgment.

As with OpenAI o1 [63], all stages are completed by the model in a single inference pass."

[63]: https://arxiv.org/pdf/2409.18486

tucnak | 4 days ago

Figure 2 in the paper shows what I really dislike about a lot of vision model benchmarks.

I care about whether these VLMs can accurately _see_ and _describe_ things in a picture. Meanwhile the vision part of these benchmarks are a lot of extremely basic OCR that any VLMs of the past year can do. The gains in score come from the LM improving logic skills not from the actual vision ability improving.

Jackson__ | 4 days ago

That first page graph has a very interesting choice of x-axis.

Wilsoniumite | 4 days ago

Has anyone found a use for LLAVA yet?

LLAMA can be trusted to summarize and format information, and some of the other models can be OK coding assistances, but when I was showing Ollama off to a friend I struggled to think of anything useful other than a party trick of "yup that's what is in the picture".

Obviously it would be useful to blind people, but the hard part is using it for something where the person could just look at the picture. Possibly could be used on a security camera and combined with a basic keyword alert, but I imagine there's a lot of false positives and false negatives.

Larrikin | 3 days ago

This paper is not comparing against MOLMO or Qwen, so I would take it with a grain of salt

snats | 4 days ago

What are options to fine tune?

For instance, if I have a CAD model of a screw fastened to a wall, can I teach it that its a screw fastened to a wall?

I have years worth of potential training data.

Consider this a multi-million dollar problem.

resource_waste | 4 days ago

Generating data with OpenAI model AND copying the approach from OpenAI model. This is a bit unsatisfactory, its like saying you wrote some working code, while in fact you’ve decompiled the binary and then compiled it again.

startupsfail | 4 days ago

[flagged]

a3w | 4 days ago