HNPWA with Next.js

AI World Clocks

hi, I made this. thank you for posting.

I love clocks and I love finding the edges of what any given technology is capable of.

I've watched this for many hours and Kimi frequently gets the most accurate clock but also the least variation and is most boring. Qwen is often times the most insane and makes me laugh. Which one is "better?"

lanewinfield | a day ago

Watching this over the past few minutes, it looks like Kimi K2 generates the best clock face most consistently. I'd never heard of that model before today!

Qwen 2.5's clocks, on the other hand, look like they never make it out of the womb.

otterley | a day ago

Since the first (good) image generation models became available, I've been trying to get them to generate an image of a clock with 13 instead of the usual 12 hour divisions. I have not been successful. Usually they will just replace the "12" with a "13" and/or mess up the clock face in some other way.

I'd be interested if anyone else is successful. Share how you did it!

baltimore | a day ago

I've been struggling all week trying to get Claude Code to write code to produce visual (not the usual, verifiable, text on a terminal) output in the form of a SDL_GPU rendered scene consisting of the usual things like shaders, pipelines, buffers, textures and samplers, vertex and index data and so on, and boy it just doesn't seem to know what it's doing. Despite providing paragraphs-long, detailed prompts. Despite describing each uniform and each matrix that needs to be sent. Despite giving it extremely detailed guidance about what order things need to be done in. It would have been faster for me to just write the code myself.

When it fails a couple of times it will try to put logging in place and then confidently tell me things like "The vertex data has been sent to the renderer, therefore the output is correct!" When I suggest it take a screenshot of the output each time to verify correctness, it does, and then declares victory over an entirely incorrect screenshot. When I suggest it write unit tests, it does so, but the tests are worthless and only tests that the incorrect code it wrote is always incorrect in the same ways.

When it fails even more times, it will get into this what I like to call "intern engineer" mode where it just tries random things that I know are not going to work. And if I let it keep going, it will end up modifying the entire source tree with random "try this" crap. And each iteration, it confidently tells me: "Perfect! I have found the root cause! It is [garbage bullshit]. I have corrected it and the code is now completely working!"

These tools are cute, but they really need to go a long way before they are actually useful for anything more than trivial toy projects.

ryandrake | a day ago

Amazing, some people are so enamored with LLMs who use them for soft outcomes, and disagree with me when I say be careful they're not perfect -- this is such a great non technical way to explain the reality I'm seeing when using on hard outcome coding/logic tasks. "Hey this test is failing", LLM deletes test, "FIXED!"

munro | a day ago

Non-determinism at it's finest. The clock is perfect, the refresh happens, the clock looks like a Dali painting.

kylecazar | a day ago

I'm having a hard time believing this site is honest, especially with how ridiculous the scaling and rotation of numbers is for most of them. I dumped his prompt into chatgpt to try it myself and it did create a very neat clock face with the numbers at the correct position+animated second hand, it just got the exact time wrong, being a few hours off.

Edit: the time may actually have been perfect now that I account for my isp's geo-located time zone

anon_cow1111 | 21 hours ago

LLMs can't "look" at the rendered HTML output to see if what they generated makes sense or not. But there ought to be a way to do that right? To let the model iterate until what it generates looks right.

Currently, at work, I'm using Cursor for something that has an OpenGL visualization program. It's incredibly frustrating trying to describe bugs to the AI because it is completely blind. Like I just wanna tell it "there's no line connecting these two points but there ought to be one!" or "your polygon is obviously malformed as it is missing a bunch of points and intersects itself" but it's impossible. I end up having to make the AI add debug prints to, say, print out the position of each vertex, in order to convince it that it has a bug. Very high friction and annoying!!!

porphyra | a day ago

Claude Sonnet 4.5 with a little thinking: https://imgur.com/a/zcJOnKy

no thinking: better clock but not current time (the prompt is confusing here though): https://imgur.com/a/kRK3Q18

anotheryou | 10 hours ago

whats going on with kimi k2 and being reasonable/so unique in so many of these benchmarks ive seen recently? I will have to try it out further for stuff. is it any good at programming?

RugnirViking | an hour ago

Why are Deepseek and Kimi are beating other models by so much margin? Is this to do with their specialization for this task?

zkmon | a day ago

Something I'm not able to wrap my head around is that Kimi K2 is the only model that produces a ticking second hand on every attempt while the rest of them are always moving continuously. What fundamental differences in model training or implementation can result in this disparity? Or was this use case programmed in K2 after the fact?

paxys | a day ago

Always interesting/uncanny when AI is tested with human cognitive tests https://www.psychdb.com/cognitive-testing/clock-drawing-test.

mandolingual | a day ago

Most look like they were done by a beginner programmer on crack, but every once in a while a correct one appears.

em3rgent0rdr | a day ago

Cool, and marginally informative on the current state of things. but kind of a waste of energy given everything is re-done every minute to compare. We'd probably only need a handful of each to see the meaningful differences.

ugh123 | a day ago

Lack of Claude is a glaring oversight given how popular it is as an agentic coding model...

edfletcher_t137 | 20 hours ago

This is such a great idea! Surprisingly, the Kimi K2 is the only one without any obvious problems. And it is even not the complete K2 thinking version? This made me reread this article from a few days ago:

https://entropytown.com/articles/2025-11-07-kimi-k2-thinking...

chaosprint | 20 hours ago

Reminds me of the Alzheimer's "draw a clock" test.

Makes me think that LLMs are like people with dementia! Perhaps it's the best way to relate to an LLM?

gwbas1c | a day ago

[deleted]

| a day ago

To be fair, This is a deceptively hard task.

S0y | a day ago

The more I look at it, the more I realise the reason for cognitive overload I feel when using LLMs for coding. Same prompt to same model for a pretty straight forward task produces such wildly different outputs. Now, imagine how wildly different the code outputs when trying to generate two different logical functions. The casings are different, commenting is different, no semantic continuity. Now maybe if I give detailed prompts and ask it to follow, it might follow, but from my experience prompt adherence is not so great as well. I am at the stage where I just use LLMs as auto correct, rather than using it for any generation.

wanderingmind | 16 hours ago

I like Deepseek v3.1's idea of radially-aligning each hour number's y-axis ("1" is rotated 30° from vertical, "2" at 60°, etc.). It would be even better if the numbers were rotated anticlockwise.

I'm not sure what Qwen 2.5 is doing, but I've seen similar in contemporary art galleries.

cornonthecobra | a day ago

Pretty cool already!

I use 'Sonnet 4.5 thinking' and 'Composer 1' (Cursor) the most, so it would be interesting to see how such SOTA models perform in this task.

arendtio | 9 hours ago

[deleted]

| a day ago

Qwen doesn't care about clocks, it goes the Dali way, without melting.

It even made a Nietzsche clock (I saw one <body> </body> which was surprisingly empty).

It definitely wins the creative award.

Bengalilol | 21 hours ago

https://gemini.google.com/share/00967146a995 works perfectly fine with gemini 2.5 pro

earth2mars | a day ago

It's really beautiful! Super clean UI.

The thing I always want from timezone tools is: “Let me simulate a date after one side has shifted but the other hasn’t.”

Humans do badly with DST offset transitions; computers do great with them.

Vera_Wilde | 13 hours ago

That's super neat. I'll keep checking back to this site as new models are released. It's an interesting benchmark.

boxedemp | 9 hours ago

This is cool, interesting to see how consistent some models are (both in success and failure)

I tried gpt-oss-20b (my go-to local) and it looks ok though not very accurate. It decided to omit numbers. It also took 4500 tokens while thinking.

I'd be interested in seeing it with some more token leeway as well as comparing two or more similar prompts. like using "current time" instead of "${time}" and being more prescriptive about including numbers

ticulatedspline | a day ago

In any case those clocks are all extremely inaccurate, even if AI could build a decent UI (which is not the case).

Some months ago I published this site for fun: https://timeutc.com There's a lot of code involved to make it precise to the ms, including adjusting based on network delay, frame refresh rate instead of using setTimeout and much more. If you are curious take a look at the source code.

collimarco | a day ago

Sonnet 4.5 does it flawless. Tried 8 times.

anonzzzies | 19 hours ago

deepseek representing

adriatp | 2 hours ago

Maybe they can ask Sora to make variations of:

https://slate.com/human-interest/2016/07/martin-baas-giant-r...

amelius | a day ago

See https://clock.rt.ht/::code

AI-optimized <analog-clock>!

People expect perfection on first attempt. This took a brief joint session:

HI: define the custom element API design (attribute/property behavior) and the CSS parts

AI: draw the rest of the f… owl

rtcode_io | a day ago

I wonder which model will silently be updated and suddenly start drawing clocks with Audemars-Piguet-level kind of complications.

3oil3 | 14 hours ago

I’m very curious about the monthly bill for such a creative project, surely some of these are pre rendered?

syx | a day ago

where's opus/sonnet! very curious on that!

nasir | a day ago

You should render it, show an image to the model and allow it to iterate. No person has to one-shot code without seeing what it looks like.

bwhiting2356 | 16 hours ago

Kimi K2 is obviously the best, but gpt-5 has the most gorgeous ones when it works

whimsicalism | a day ago

Interesting idea!

Why is a new clock being rendered every minute? Or AI models are evolving and improving every minute.

shahzaibmushtaq | 14 hours ago

It is funny to see the performance improve across many of the models, somewhat miraculously, throughout the day today.

wewtyflakes | 15 hours ago

What does it mean that each model is allowed 2000 tokens to generate its clock?

orly01 | a day ago

I just realized I'm running late, it's almost -2!

More seriously, I'd love to see how the models perform the same task with a larger token allowance.

bigbluedots | 19 hours ago

Add some voting and you got yourself an AI World Clock arena! https://artificialanalysis.ai/image/arena

kfarr | a day ago

Very funny. It seems the Qwen generates the funniest outputs :)

hansmayer | a day ago

Watching these gives me a strong feeling of unease. Art-wise, it is a very beautiful project.

josfredo | 15 hours ago

Reminds me of this: https://www.youtube.com/watch?v=OGbhJjXl9Rk

fschuett | a day ago

just curious, why not the sonnet models? In my personal experience, Anthropic's Sonnet models are the best when it comes to things like this!

aavshr | a day ago

Try adding to the prompt that it has a PhD in Computer Science and have many methods for dealing with complexity.

This gives better results, at least for me.

xyproto | a day ago

[deleted]

| a day ago

Weird. Sonnet 4.5 one shotted it with:

Create an interactive artifact of an analog clock face that keeps time properly.

https://claude.ai/public/artifacts/75daae76-3621-4c47-a684-d...

bongodongobob | a day ago

Selection of western models is weird no gpt-5.1 , opus 4.1 ( nailed it perfectly ) Something I quickly tested

maxdo | 21 hours ago

[deleted]

| a day ago

[deleted]

| a day ago

How Kımı is better than other BILLION$ companys is really fun

Zeraous | 8 hours ago

If a human had done this, these would be at a museum

stym06 | 15 hours ago

Was Claude banned from this Olympics?

zkmon | a day ago

This is great. If you think that the phenomena of human-like text generation evinces human-like intelligence, then this should be taken to evince that the systems likely have dementia. https://en.wikipedia.org/wiki/Montreal_Cognitive_Assessment

abathologist | a day ago

This is why we need TeraWatt DCs, to generate code for world clocks every minute.

__fst__ | a day ago

Looks like we've got a new Turing test here: "draw me a clock"

HarHarVeryFunny | 21 hours ago

This is an AD for Kimi K2

esotericwarfare | 19 hours ago

Sonnet 4.5 did this easily https://claude.ai/public/artifacts/c1bb5d57-573b-49e0-9539-7...

ada1981 | 21 hours ago

Is there a "draw a pelican riding a bicycle" version?

bigbluedots | 19 hours ago

GPT-5 looks broken

baidoct | 8 hours ago

How do they do time without JavaScript? Is there an API I’m not aware of?

Waterluvian | a day ago

Qwens clocks are hilarious

Imanari | a day ago

I love that GPT-5 is putting the clock hands way outside the frame and just generally is a mess. Maybe we'll look back on these mistakes just like watching kids grow up and fumble basic tasks. Humorous in its own unique way.

accrual | 21 hours ago

Because a new clock is generated every minute, looks like simply changing the time by a digit causes the result to be significantly different from the previous iteration.

busymom0 | a day ago

Seems like Will's clock drawing test in Hannibal :)

0xCE0 | a day ago

This really needs to be an xscreensaver hack.

ssl-3 | a day ago

The qwen clocks are art.

woopwoop | 15 hours ago

Its cool to see them get it right .....sometimes

AlfredBarnes | a day ago

Grok is impressive, I should give it a shot

jcmontx | a day ago

anyone tried opening this from mobile? not a single clock renders correctly, almost looks like a joke on LLMs

gloosx | a day ago

The new Turing time test

miohtama | 19 hours ago

[deleted]

| a day ago

GPT-5 is embarrassing itself. Kimi and DeepSeek are very consistently good. Wild that you can just download these models.

mstipetic | a day ago

obviously they're all broken on firefox, no one uses firefox anyways

hollow-moe | a day ago

I believe that in a day or two, the companies will address this and it would be solved by them for that use case

JamesAdir | 12 hours ago

grok's looks like one of those clocks you'd find at a novelty shop

bananatron | a day ago

not sure about the accuracy though, although shooting in the dark

shubham_zingle | a day ago

Honestly, I think if you track the performance of each over time, since these get regenerated once in a while, you can then have a very, very useful and cohesive benchmark.

lxe | a day ago

would be gr8t to also see the prompt this was done with

larodi | a day ago

Lol. This is supposed to replace me at my job already?

Great experiment!

warpspin | 7 hours ago

i wonder kwen prompt woud look like hallucination?

1yvino | a day ago

666

cyberjill | 17 hours ago

I love qwen, it tries so hard with its little paddle and never gets anywhere.

imchillyb | 18 hours ago

I'm reminded of the "draw a clock" test neurologists use to screen for dementia and brain damage.

bitwize | 21 hours ago

Qwen 2.5 doing a surprisingly good job (as of right now).

teaearlgraycold | a day ago

How can Deepseek and Kimi get it right while Haiku, Gemini and GPT are making a mess?

DeathArrow | a day ago

Security-wise, this is a website that takes the straight output of AI and serves it for execution on their website.

I know, developers do the same, but at least they check it in Git to notice their mistakes. Here is an opportunity for AI to call a Google Authentication on you, or anything else.

eastbound | a day ago

It's wild how much the output varies for the same model for each run.

I'm not sure if this was the intent or not, but it sure highlights how unreliable LLMs are.

bpt3 | a day ago

Oh cool, it's the schizophrenia clock-drawing test but for AI.

novemp | a day ago

Ask Claude or ChatGPT to write it in Python, and you will see what they are capable of. HTML + CSS has never been the strong suit of any of these models.

system2 | a day ago

Now that is actually creative.

Granted, it is not a clock - but it could be art. It looks like a Picasso. When he was drunk. And took some LSD.

shevy-java | a day ago

kimi is kicking ass

jonplackett | a day ago

What a waste of energy.

kwanbix | a day ago

whatever model Cursor uses was telling me the date was March 12, 2023

fnord77 | 18 hours ago

What a wonderfully visual example of the crap LLMs turn everything into. I am eagerly awaiting the collapse of the LLM bubble. JetBrains added this crap to their otherwise fine series of IDEs and now I have to keep removing randomly inserted import statements and keep fixing hallucinated names of functions suggested instead of the names of functions that I have already defined in the same file. Lack of determinism where we expect it (most of the things we do, tbh) is creating more problems than it is solving.

surfingdino | 11 hours ago

lol

jsmo | 14 hours ago

[deleted]

| a month ago

[dead]

Gormanu | a day ago

[dead]

superlukas99 | 17 hours ago

Why? This is diagonal to how LLM's work, and trivially solved by a minimal hybrid front/sub system.

PeterStuer | a day ago

These types of tests are fundamentally flawed. I was able to create perfect clock using gemini 2.5 pro - https://gemini.google.com/share/136f07a0fa78

kburman | a day ago

Limiting the model to only use 2000 tokens while also asking it to output ONLY HTML/CSS is just stupid. It's like asking a programmer to perform the same task while removing half their brain and also forget about their programming experience. This is a stupid and meaningless benchmark.

awkwam | a day ago