If you are interested in that space, would throw our project in the mix which uses ColPali under the hood transparently.
https://github.com/tjmlabs/ColiVara
The main benchmark for this is the Vidore leaderboard. Where we would love to see where VoyageAI performs compared to the more open-source implementations.
I'm missing something. Shouldn't any llm that's 'natively multimodal' somehow include embeddings which are multi-modal? for ex here's googles blogpost on Gemini
Until now, the standard approach to creating multimodal models involved
training separate components for different modalities and then stitching them
together to roughly mimic some of this functionality. These models can
sometimes be good at performing certain tasks, like describing images, but
struggle with more conceptual and complex reasoning.
We designed Gemini to be natively multimodal, pre-trained from the start on
different modalities. Then we fine-tuned it with additional multimodal data to
further refine its effectiveness. This helps Gemini seamlessly understand and
reason about all kinds of inputs from the ground up, far better than existing
multimodal models — and its capabilities are state of the art in nearly every
domain.
Indeed, sad that their models are both commercial proprietary and API only.
This does read very impressive. Any critical perspectives on the presented evaluation? What about noon-English text?
I understand the model is, like for other commercial ones, available exclusively through their API, right?
API-only model. No thanks but congrats anyway.
Looks quite interesting! I’ve been working on AnyModal, a framework for integrating different data types (like images and audio) with LLMs: https://github.com/ritabratamaiti/AnyModal. It seems that voyage-multimodal-3 would be quite promising in developing multimodal LLMs, but I am not sure if that is the intended use case.
In the traditional Python API, the Voyage engine will tokenize blocks of text and output a string of characters. This model seems to be doing that by vectorizing images in space.
Words like 'you' and 'apple' will be a unitary token. More complex terms like 'pikachu' may be divided into pik-a-chu.
This is a cool way to look at multimodal embeddings. They look at performance as the the percentage of inputs slides from one modality to another:
https://i0.wp.com/blog.voyageai.com/wp-content/uploads/2024/...
The colab measures dot product values 0.428 and 0.498, describing them as "...similarity value is quite high." Is that high? Can you design a system that confidently labels data with a 0.4 threshold?
Check out ColPali and ColQwen for a SOTA open source version.
I wish people would take the time to put in real datasets and make qualitative analysis of when and why "foo new solution" is better.
Quantitative benchmarks are great, but sparse.
Funny, all those big name Stanford advisors for a company that builds embeddings... A couple of strong MLEs can deliver everything they are doing. This shouldn't be a company but OK... I'm sure some clueless VCs in SV gave them money.
And just to be clear. I don't think that delivering strong embeddings for different domains is an easy task. However, it's 2024 not 2016.
This is a key observation that is simple and intuitive:
>All CLIP-like models perform poorly on mixed-modality search due to a phenomenon known as the modality gap. As illustrated in the figure below, the closest vector to the snippet “I address you, members of the Seventy-Seventh Congress…” is not its screenshot, but other texts. This leads to search results that are skewed towards items of the same modality; in other words, text vectors will be closer to irrelevant texts than relevant images in the embedding space.