Yeah, for better or worse, the way the median startup interfaces with AI these days is through an LLM API, and that&#x27;s what all the workflows are built around, so that&#x27;s what we&#x27;re targeting. Though, depending on what you&#x27;re trying to do, I wouldn&#x27;t discount the use of starting with a pretrained model—there was that famous result from 2022 that showed that pretraining a model on _Wikipedia_ made training on Atari games more than twice as efficient [0]; these days, LLMs have huge amounts of priors about the real world that make them great starting points for a surprisingly diverse set of tasks (e.g. see the chemistry example in our video!)[0]: <a href="https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2201.12122" rel="nofollow">https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2201.12122</a>

Have you heard of <a href="https:&#x2F;&#x2F;puffer.ai" rel="nofollow">https:&#x2F;&#x2F;puffer.ai</a>? Might fit your use case

Was excited to see something about reinforcement learning as I&#x27;m working on training an agent to play a game, but apparently all reinforcement learning nowadays is for LLMs.

To add to this, you can currently manually parse tool calls in your environment&#x27;s step function, but we&#x27;ll be rolling out a UI that makes this easier soon.

Thanks! Our goal is to make rl &quot;just work&quot; with completely automated GPU provisioning&#x2F;algorithm selection&#x2F;SFT-warm up, but giving people the ability to switch away from the defaults if they want to.The way tools currently work in the beta is you add tools via MCP to the configuration, and they get passed in as additional context for the model; the model might then choose to use a tool during inference; the tool is then automatically called and the output is returned as a tool message. If you really want to you could parse the tool output as part of reward calculation, but I expect you&#x27;d usually base the reward just on the model&#x27;s completion. I could give more details if there&#x27;s a specific tool setup you&#x27;re envisioning!

This is really neat! Didn’t realize it could be this simple to run RL on models. Quick question: How would I specify the reward function for tool use? or is this something you automatically do for me when I specify the available tools and their uses?

Isn’t the latest trend in RL mostly about prompt optimization as opposed to full fine tuning

DSPy is great for prompt optimization but not so much for RL fine-tuning (their support is &quot;extremely EXPERIMENTAL&quot;). The nice thing about RL is that the exact prompts don&#x27;t matter so much. You don&#x27;t need to spell out every edge case, since the model will get an intuition for how to do its job well via the training process.

ART is also great, though since it&#x27;s built on top of Unsloth it&#x27;s geared towards single GPU QLoRA training. We use 8 H100s as a standard, so we can handle larger models and full-parameter fine-tunes.

Perhaps less about DSPy, and rather about this: <a href="https:&#x2F;&#x2F;github.com&#x2F;OpenPipe&#x2F;ART" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;OpenPipe&#x2F;ART</a>

Is there any credence to the view that these startups are basically dspy wrappers

Hey HN, we’re Andrew and Derik at RunRL (<a href="https:&#x2F;&#x2F;runrl.com&#x2F;">https:&#x2F;&#x2F;runrl.com&#x2F;</a>). We&#x27;ve built a platform to improve models and agents with reinforcement learning. If you can define a metric, we&#x27;ll make your model or agent better, without you having to think about managing GPU clusters.Here&#x27;s a demo video: <a href="https:&#x2F;&#x2F;youtu.be&#x2F;EtiBjs4jfCg" rel="nofollow">https:&#x2F;&#x2F;youtu.be&#x2F;EtiBjs4jfCg</a>I (Andrew) was doing a PhD in reinforcement learning on language models, and everyone kept...not using RL because it was too hard to get running. At some point I realized that someone&#x27;s got to sit down and actually write a good platform for running RL experiments.Once this happened, people started using it for antiviral design, formal verification, browser agents, and a bunch of other cool applications, so we decided to make a startup out of it.How it works:- Choose an open-weight base model (weights are necessary for RL updates; Qwen3-4B-Instruct-2507 is a good starting point)- Upload a set of initial prompts (&quot;Generate an antiviral targeting Sars-CoV-2 protease&quot;, &quot;Prove this theorem&quot;, &quot;What&#x27;s the average summer high in Windhoek?&quot;)- Define a reward function, using Python, an LLM-as-a-judge, or both- For complex settings, you can define an entire multi-turn environment- Watch the reward go up!For most well-defined problems, a small open model + RunRL outperforms frontier models. (For instance, we&#x27;ve seen Qwen-3B do better than Claude 4.1 Opus on antiviral design.) This is because LLM intelligence is notoriously &quot;spiky&quot;; often models are decent-but-not-great at common-sense knowledge, are randomly good at a few domains, but make mistakes on lots of other tasks. RunRL creates spikes precisely on the tasks where you need them.Pricing: $80&#x2F;node-hour. Most models up to 14B parameters fit on one node (0.6-1.2 TB of VRAM). We do full fine-tuning, at the cost of parameter-efficiency (with RL, people seem to care a lot about the last few percent gains in e.g. agent reliability).Next up: continuous learning; tool use. Tool use is currently in private beta, which you can join here: <a href="https:&#x2F;&#x2F;forms.gle&#x2F;D2mSmeQDVCDraPQg8" rel="nofollow">https:&#x2F;&#x2F;forms.gle&#x2F;D2mSmeQDVCDraPQg8</a>We&#x27;d love to hear any thoughts, questions, or positive or negative reinforcement!

Launch HN: RunRL (YC X25) – Reinforcement learning as a service