I tested the small model with a few images from Clevr. On first blush I am afraid it didn't do very well at all, it got object counts totally wrong and struggled to identify shapes and colours.
Still, it seems to understand what's in the images in general (cones and spheres and cubes), and the fact that it runs on my mac book at all is basically amazing.
Did they fix multiline editing yet? Any interactive input that wraps across 3+ lines seems to become off-by-one when editing (but fine if you only append?), and this will be only more common with long filenames being added. And triple-quote breaks editing entirely.
How does this address the security concern of filenames being detected and read when not wanted?
Is Qwen2VL supported too? Its a great vision model, works in comfyui. Llama3.2s vision seems to be super censored...
I thought llamacpp didn't support images yet, has that changed or ollama is using a different library for this?
Does anyone know if this will run on the iPhone 15 (6GB) or iPhone 16 (8GB)
Can it run the quantized models?
how likely is it to run on a reasonably new windows laptop?
This was a pretty heavy lift for us to get out which was why it took a while. In addition to writing new image processing routines, a vision encoder, and doing cross attention, we also ended up re-architecting the way the models get run by the scheduler. We'll have a technical blog post soon about all the stuff that ended up changing.