Skyvern Browser Agent 2.0: How We Reached State of the Art in Evals

suchintan | 30 points

Many of the examples given for agents such as this are things I just flat wouldn’t trust an LLM to do - buying something on Amazon for example: Will it pick new or ‘renewed’? Will it select an item that is from a janky looking vendor and may be counterfeit? Will it pick the cheapest option for me? What if multiple colors are offered?

This one example alone has so many branches that would require knowing what’s in my head.

On the flip side, it’s a ridiculously simple task for a human to do for themselves, so what am I truly saving?

Call me when I can ask it to check the professional reviews of X category on N websites (plus YouTube), summarize them for me, and find the cheapest source for the top 2 options in the category that will arrive in Y days or sooner.

That would be useful.

happyopossum | 6 hours ago

This is an impressive tool. I especially like the observability around the workflow and the steps it takes to achieve the outcome. We are potentially interested in exploring this if we can get the cost down at scale.

lyime | 5 hours ago

isn't browser use sota on web voyager? At this point web voyager seems to be outdated, there's def a need for a new harder benchmark.

skull8888888 | 3 hours ago

congrats Suchintan! huge achievement!

govindsb | 6 hours ago