Launch HN: Confident AI (YC W25) – Open-source evaluation framework for LLM apps

jeffreyip | 107 points

The DAG feature for subjective metrics sounds really promising. I've been struggling with the same "good email" problem. Most of the existing benchmarks are too rigid for nuanced evaluations like that. Looking forward to seeing how that part of DeepEval evolves.

codelion | 6 hours ago

This looks nice and flashy for an investor presentation, but practically I just need the thing to work off of an API or if it is all local to at least have vllm support so it doesn't take 10 hours to run a bench.

The extra long documentation and abstractions for me personally are exactly what I DONT want to have in a benchmarking repo. I.e. what transformers version is this, will it support TGI v3, will it automatically remove thinking traces with a flag in the code or running command, will it run the latest models that need custom transformer version etc.

And if it's not a locally runnable product it should at least have a public accessable leaderboard to submit oss models too or something.

Just my opinion. I don't like it. It looks like way too much docs and code slop for what should just be a 3 line command.

nisten | a day ago

>This brings us to our current limitations. Right now, DeepEval’s primary evaluation method is LLM-as-a-judge. We use techniques such as GEval and question-answer generation to improve reliability, but these methods can still be inconsistent. Even with high-quality datasets curated by domain experts, our evaluation metrics remain the biggest blocker to our goal.

Have you done any work on dynamic data generation?

I've found that even taking a public benchmark and remixing the order of questions had a deep impact on model performance - ranging from catastrophic for tiny models to problematic for larger models once you get past their effective internal working memory.

llm_trw | 19 hours ago

DAG sounds interesting. Might help me to solve my biggest challenge with evals right now, which is testing subjective metrics e.g. “is this a good email”

stereobit | 10 hours ago

This looks great. I would love to know more what makes Confident AI/DeepEval special compared to tons of other LLM Eval tools out there.

tracyhenry | a day ago

This is an awesome tool! Been using it since day 1 and will keep using it. Would recommend to anyone looking for an LLM Eval tool

jchiu220 | 21 hours ago

Was also looking at Langfuse.ai or braintrust.dev

Anybody with experience can give me a tip of the best way to - evaluate - manage prompts - trace calls

TeeWEE | a day ago

Congrats guys! Back in the spring of last year I did an initial spike investigating tools that could evaluate the accuracy of responses in our RAG queries where I work. We used your services (tests and test dashboard) as a little demo.

fullstackchris | a day ago

this is sick, all star founders making big moves ;)

avipeltz | 19 hours ago

<deleted>

calebkaiser | a day ago