What are your processes waiting on? in Linux top, show the WCHAN field. In FreeBSD top, look at the STATE field. Ideally, your service processes are waiting on i/o (epoll, select, kqread, etc) or you're CPU limited.
Is there any cross-room communication? Can you spawn a process per room? Scaling limited at 25% CPU on a 4 vcpu node strongly suggests a locked section limiting you to effectively single threaded performance. Multiple processes serving rooms should bypass that if you can't find it otherwise, but maybe there's something wrong in your load balancing etc.
Personally, I'd rather run with fewer layers, because then you don't have to debug the layers when you have perf issues. Do matchmaking wherever with whatever layers, and let your room servers run in the host os, no containers. But nobody likes my ideas. :P
Edit to add: your network load is tiny. This is almost certainly something with your software, or how you've setup your layers. Unless those vCPUs are ancient, you should be able to push a whole lot more packets.
3000 pps / 6 Mbps is pretty much nothing for that server. I wouldn't change random network sysctl options.
> This suggests to me that the bottleneck isn’t CPU or app logic, but something deeper in the stack
Just a word of caution - I have seen plenty of people speed towards eg "it must be a bug in the kernel" when 98% of the time it is the app or some config.
Any monitoring/logging in you nodejs code?
I noticed, for example, adding a newrelic agent drops http throughput almost 10x.
Are you buffering your output? Doing one syscall (write) for each client in a server for each keystroke is a significant amount of IO overhead and context switching.
Try buffering the outgoing keystrokes to each client. Then, someone typing "hello world" in a server of 50 people will use 50 syscalls instead of 550 syscalls.
Think Nagle's algorithm.
Please also note that Hetzner is not providing CPU steal information inside of VMs. So there could be 75% steal and you wouldn't notice! It's unlikely for CCX instances, but it happened for me a lot with regular instances.
It sounds like you want to coalesce the outbound updates otherwise everyone typing is accidentally quadratic.
Are you awaiting anywhere, such that you might be better off doing fire n forget instead?
are you using uwebsockets.js?
25% CPU usage could indicate that your I/O throughput is bottlenecked.
Node gives access to event loop utilization stats that may be of value.
See the docs for how it works and how to derive some value from it.We had a similar situation where our application was heavily IO bound (very little CPU) which caused some initial confusion with slowdown. We ended up added better metrics surrounding IO and the event loop which lead to us batch dequeuing our jobs in a more reasonable way that made the entire application much more effective.
If you crack the nut on this issue, I'd love to see an update comment detailing what the issue and solution was!