What are the trade-offs you've made to achieve this?
1. Why do you have a limited number of models publicly? Do you have to configure each one manually?
2. I don't see the 50% cheaper option. According to your pricing page, 16B+ models will cost $0.90, which is the same price for Together.ai and fireworks.ai
Unrelated: During the dot-com boom, there was a company called nCompass Labs that developed one of the first content management systems (https://en.wikipedia.org/wiki/NCompass_Labs_Inc). Microsoft bought them in 2001. Their product was, "a plug-in for hosting ActiveX controls in Netscape Navigator named ScriptActive." ActiveX itself was a novelty, using C++ templates to define reusable and _downloadable_ web components.
All of this crap was happily replaced with JavaScript frameworks in later years. Yes, back in the early-2000s, your browser might literally download executable code just to render a custom button.
Interesting approach to model serving - the 2-4x lower TTFT compared to vLLM is impressive, but I'd be curious to see detailed benchmarks across different batch sizes and model architectures to validate those performance claims. The no rate limits policy is bold but could get expensive fast if you're not doing some clever GPU utilization under the hood.
One vote for image inputs here. I would love a fine-tuned qwen-2-vl-72b on demand, but most of the solutions are "talk to us" level expensive. I'm assuming you beat the price or convenience of a replicate / modal solution?
Since you're calling out your support for underserved models, can I request you support some SOTA embeddings models? Support for embeddings is poor from other providers with only a handful of outdated models and poor latency.
https://console.ncompass.tech/models has no models on it, just a "Get in Touch" button.
this sounds like black magic, kudos to you. i'd love to chat, dm me on https://twitter.com/swyx if you'd find it useful to chat with someone like me.
What is Groq (rate limited) missing that you aren't?
Random idea -- I think it would be cool for hosts that advertise efficiency to have a dashboard that shows total tokens per watt-hour (or whatever usage:energy metric) graphed over time for each model they host, taking into account as much of their infra as possible.
This would:
- let you boast about your cool proprietary optimizations
- naturally get better over time just from applying public algorithmic improvements
- show up hosts that refuse to do the same
- give you a good incentive to keep on top of your own efficiency and competitiveness over time
- be a good response to users who vaguely know that AI takes "a lot" of energy -- it's actually gotten a lot better, but how much better?
Happy to chat if it would help to have a neutral academic voice involved.