What would really be helpful here is adding information about how much inference you can do on a given GPU.Since transformers &amp; attention scale quadratically this is quite important.Knowing that I can load X model on a given GPU is good, knowing that I can expect to be able to perform Y inference at N tok&#x2F;sec on that GPU with however much concurrence.

Hey HN, I made this simple GitHub tool to check how much vRAM you need to train or inference any LLM. It supports inference frameworks (huggingface&#x2F;vLLM&#x2F;llama.cpp) &amp; quantization (GGML&#x2F;bnb)
I Made this after getting frustrated when I couldn&#x27;t get 4bit 7b llama to work on my RTX 4090 24GB gpu even though the model is only 7GB.

It&#x27;s so over, I cant even run the lowest LLama model; It never began for iGPUcels <a href="https:&#x2F;&#x2F;ibb.co&#x2F;jy6B0sf" rel="nofollow noreferrer">https:&#x2F;&#x2F;ibb.co&#x2F;jy6B0sf</a>I was actually looking for something that could test what LLM I could run, thanks a lot haha

Show HN: Can your GPU run this LLM?