Haven't seen Mozilla's LocalScore [1] mentioned in the comments yet. It's exactly made for the purpose of finding out how well different models run on different hardware.
What is everyone using their local LLMs for primarily? Unless you have a beefy machine, you'll never approach the level of quality of proprietary models like Gemini or Claude, but I'm guessing these smaller models still have their use cases, just not sure what those are.
I concur LocalLLama subreddit recommendation. Not in terms of choosing "the best model" but to answer questions, find guides, latest news and gossip, names of the tools, various models and how they stack against each other, etc.
There's no one "best" model, you just try a few and play with parameters and see which one fits your needs the best.
Since you're on HN, I'd recommend skipping Ollama and LMStudio. They might restrict access to the latest models and you typically only choose from the ones they tested with. And besides what kind of fun is this when you don't get to peek under the hood?
llamacpp can do a lot itself, and you can do most recently released models (when changes are needed they adjust literally within a few days). You can get models from huggingface obviously. I prefer GGUF format, saves me some memory (you can use lower quantization, I find most 6-bit somewhat satisfactory).
I find that the the size of the model's GGUF file with roughly tell me if it'll fit in my VRAM. For example 24Gb GGUF model will NOT fit in 16Gb, whereas 12Gb likely will. However, the more context you add the more RAM will be needed.
Keep in mind that models are trained with certain context window. If it has 8Kb context (like most older models do) and you load it with 32Kb context it won't be much help.
You can run llamacpp on Linux, Windows, or MacOS, you can get the binaries or compile on your local. It can split the model between VRAM and RAM (if the model doesn't fit in your 16Gb). It even has simple React front-end (llamacpp-server). The same module provides REST service which has similar (but simpler) protocol to OpenAI and all the other "big" guys.
Since it implements OpenAI REST API, it also works with a lot of front-end tools if you want more functionality (ie oobabooga aka textgeneration webui).
Koboldcpp is another backend you can try if you find llamacpp to be too raw (I believe it's the still llamacpp under the hood).
The best place to look is HuggingFace
Qwen is pretty good and in a variety of sizes. I’d suggest this one Qwen/Qwen3-14B-GGUF Q4_K_M for you given your vram and to run it using llama-server or lm studio (might be alternatives to lm studio but generally these are nice uis for llama server), it’ll use around 7-8GB for weights, leaving room for incidentals
Llama 3.3 could work for you
Devstral is too big but could run a quantized model
Gemma is good, tends to refuse a lot. Medgemma is a nice thing to have in case
“Uncensored” Dolphin models from Eric Hartford and “abliterated” models are what you want if you don’t want them refusing requests, it’s mostly not necessary for routine use, but sometimes you ask em to write a joke and they won’t do it, or if you wanted to do some business which involves defense contracting or security research, that kind of thing, could be handy.
Generally it’s bf16 dtype so you multiply the number of billions of parameters by two to get the model size unquantized
Then to get a model that fits on your rig, generally you want a quantized model, typically I go for “Q4_K_M” which means 4bits per param, so you divide the number of billions of params by two to calculate the vram for the weights.
Not sure the overhead for activations but might be a good idea to leave wiggle room and experiment with sizes well below 16GB
Llama server is a good way to run AI and has a gui on the index route and -hf to download models
LM Studio is a good gui and installs llama server for you and can help with managing models
Make sure you run some server that loads the model once. You definitely don’t want to load many gigabytes of weights into vram every question if you want fast realtime answers
I only have 8gb of vram to work with currently, but I'm running OpenWebUI as a frontend to ollamma and I have a very easy time loading up multiple models and letting them duke it out either at the same time or in a round robin.
You can even keep track of the quality of the answers over time to help guide your choice.
Qwen3 family (and the R1 qwen3-8b distill) is #1 in programming and reasoning.
However it's heavily censored on political topics because of its Chinese origin. For world knowledge, I'd recommend Gemma3.
This post will be outdated in a month. Check https://livebench.ai and https://aider.chat/docs/leaderboards/ for up to date benchmarks
At 16GB a Q4 quant of Mistral Small 3.1, or Qwen3-14B at FP8, will probably serve you best. You'd be cutting it a little close on context length due to the VRAM usage... If you want longer context, a Q4 quant of Qwen3-14B will be a bit dumber than FP8 but will leave you more breathing room. Mistral Small can take images as input, and Qwen3 will be a bit better at math/coding; YMMV otherwise.
Going below Q4 isn't worth it IMO. If you want significantly more context, probably drop down to a Q4 quant of Qwen3-8B rather than continuing to lobotomize the 14B.
Some folks have been recommending Qwen3-30B-A3, but I think 16GB of VRAM is probably not quite enough for that: at Q4 you'd be looking at 15GB for the weights alone. Qwen3-14B should be pretty similar in practice though despite being lower in param count, since it's a dense model rather than a sparse one: dense models are generally smarter-per-param than sparse models, but somewhat slower. Your 5060 should be plenty fast enough for the 14B as long as you keep everything on-GPU and stay away from CPU offloading.
Since you're on a Blackwell-generation Nvidia chip, using LLMs quantized to NVFP4 specifically will provide some speed improvements at some quality cost compared to FP8 (and will be faster than Q4 GGUF, although ~equally dumb). Ollama doesn't support NVFP4 yet, so you'd need to use vLLM (which isn't too hard, and will give better token throughput anyway). Finding pre-quantized models at NVFP4 will be more difficult since there's less-broad support, but you can use llmcompressor [1] to statically compress any FP16 LLM to NVFP4 locally — you'll probably need to use accelerate to offload params to CPU during the one-time compression process, which llmcompressor has documentation for.
I wouldn't reach for this particular power tool until you've decided on an LLM already, and just want faster perf, since it's a bit more involved than just using ollama and the initial quantization process will be slow due to CPU offload during compression (albeit it's only a one-time cost). But if you land on a Q4 model, it's not a bad choice once you have a favorite.
I'd suggest buying a better GPU, only because all the models you want need a 24GB card. Nvidia... more or less robbed you.
That said, Unsloth's version of Qwen3 30B, running via llama.cpp (don't waste your time with any other inference engine), with the following arguments (documented in Unsloth's docs, but sometimes hard to find): `--threads (number of threads your CPU has) --ctx-size 16384 --n-gpu-layers 99 -ot ".ffn_.*_exps.=CPU" --seed 3407 --prio 3 --temp 0.6 --min-p 0.0 --top-p 0.95 --top-k 20` along with the other arguments you need.
Qwen3 30B: https://huggingface.co/unsloth/Qwen3-30B-A3B-128K-GGUF (since you have 16GB, grab Q3_K_XL, since it fits in vram and leaves about 3-4GB left for the other apps on your desktop and other allocations llama.cpp needs to make).
Also, why 30B and not the full fat 235B? You don't have 120-240GB of VRAM. The 14B and less ones are also not what you want: more parameters are better, parameter precision is vastly less important (which is why Unsloth has their specially crafted <=2bit versions that are 85%+ as good, yet are ridiculously tiny in comparison to their originals).
Full Qwen3 writeup here: https://unsloth.ai/blog/qwen3
I'm afraid that 1) you are not going to get a definite answer, 2) an objective answer is very hard to give, 3) you really need to try a few most recent models on your own and give them the tasks that seem most useful/meaningful to you. There is drastic difference in output quality depending on the task type.
Question might sound very basic - but this is a question for the hardware folks - Are there any AI enabled embedded software tools that makes the lives of a embedded developer easier? Also, how many embedded developers would be there in large MNCs like automobile, medical devices or consumer electronics? I am trying to judge and TAM for such a startup.
Generally speaking, how can you tell how much vram a model will take? It seems to be a valuable bit of data which is missing from downloadable models (gguf) files.
Ollama[0] has a collection of models that are either already small or quantized/distilled, and come with hyperparameters that are pretty reasonable, and they make it easy to try them out. I recommend you install it and just try a bunch because they all have different "personalities", different strengths and weaknesses. My personal go-tos are:
Qwen3 family from Alibaba seem to be the best reasoning models that fit on local hardware right now. Reasoning models on local hardware are annoying in contexts where you just want an immediate response, but vastly outperform non-reasoning models on things where you want the model to be less naive/foolish.
Gemma3 from google is really good at intuition-oriented stuff, but with an obnoxious HR Boy Scout personality where you basically have to add "please don't add any disclaimers" to the system prompt for it to function. Like, just tell me how long you think this sprain will take to heal, I already know you are not a medical professional, jfc.
Devstral from Mistral performs the best on my command line utility where I describe the command I want and it executes that for me (e.g. give me a 1-liner to list the dotfiles in this folder and all subfolders that were created in the last month).
Nemo from Mistral, I have heard (but not tested) is really good for routing-type jobs, where you need something with to make a simple multiple-choice decision competently with low latency, and is easy to fine-tune if you want to get that sophisticated.
Basic conversations are essentially RP I suppose. You can look at KoboldCPP or SillyTavern reddit.
I was trying Patricide unslop mell and some of the Qwen ones recently. Up to a point more params is better than worrying about quantization. But eventually you'll hit a compute wall with high params.
KV cache quantization is awesome (I use q4 for a 32k context with a 1080ti!) and context shifting is also awesome for long conversations/stories/games. I was using ooba but found recently that KoboldCPP not only runs faster for the same model/settings but also Kobold's context shifting works much more consistently than Ooba's "streaming_llm" option, which almost always re-evaluates the prompt when hooked up to something like ST.
Related question: what is everyone using to run a local LLM? I'm using Jan.ai and it's been okay. I also see OpenWebUI mentioned quite often.
Did someone have a chance to try local llama for the new AMD AI Max+ with 128gb of unified RAM?
Wow, a 5060Ti. 16gb + I'm guessing >=32gb ram. And here I am spinning Ye Olde RX 570 4gb + 32gb.
I'd like to know how many tokens you can get out of the larger models especially (using Ollama + Open WebUI on Docker Desktop, or LM Studio whatever). I'm probably not upgrading GPU this year, but I'd appreciate an anecdotal benchmark.
- gemma3:12b
- phi4:latest (14b)
- qwen2.5:14b [I get ~3 t/s on all these small models, acceptably slow]
- qwen2.5:32b [this is about my machine's limit; verrry slow, ~1 t/s]
- qwen2.5:72b [beyond my machine's limit, but maybe not yours]
This is what i have https://sabareesh.com/posts/llm-rig/ All You Need is 4x 4090 GPUs to Train Your Own Model
I have an RTX 3070 with 8GB VRAM and for me Qwen3:30B-A3B is fast enough. It's not lightning fast, but more than adequate if you have a _little_ patience.
I've found that Qwen3 is generally really good at following instructions and you can also very easily turn on or off the reasoning by adding "/no_think" in the prompt to turn it off.
The reason Qwen3:30B works so well is because it's a MoE. I have tested the 14B model and it's noticeably slower because it's a dense model.
People ask this question a lot and annoyingly the answer is: there are many definitions of “best”. Speed, capabilities (e.g. do you want it to be able to handle images or just text?), quality, etc.
It’s like asking what the best pair of shoes is.
Go on Ollama and look at the most popular models. You can decide for yourself what you value.
And start small, these things are GBs in size so you don’t want to wait an hour for a download only to find out a model runs at 1 token / second.
I think you'll find that on that card most models that are approaching the 16G memory size will be more than fast enough and sufficient for chat. You're in the happy position of needing steeper requirements rather than faster hardware! :D
Ollama is the easiest way to get started trying things out IMO: https://ollama.com/
Phi-4 is scared to talk about anything controversial, as if they're being watched.
I asked it a question about militias. It thought for a few pages about the answer and whether to tell me, then came back with "I cannot comply".
Nidum is the name of uncensored Gemma, it does a good job most of the time.
I find Ollama + TypingMind (or similar interface) to work well for me. As for which models, I think this is changing from one month to the next (perhaps not quite that fast). We are in that kind of period. You'll need to make sure the model layers fit in VRAM.
Good question. I've had some success with Qwen2.5-Coder 14B, I did use the quantised version: huggingface.co/Qwen/Qwen2.5-Coder-14B-Instruct-GGUF:latest It worked well on my MacBook Pro M1 32Gb. It does get a bit hot on a laptop though.
I like the Mistral models. Not the smartest but I find them to have good conversation while being small, fast and efficient.
And the part I like the most is there is almost no censorship, at least not for the models I tried. For me, having an uncensored model is one of the most compelling reasons for running a LLM locally. Jailbreaks are a PITA and abliteration and other uncensoring fine-tunings tends to make models that have been made dumb by censorship even dumber.
Gemma-3-12b-qat https://huggingface.co/google/gemma-3-12b-it-qat-q4_0-gguf
Qwen_Qwen3-14B-IQ4_XS.gguf https://huggingface.co/bartowski/Qwen_Qwen3-14B-GGUF
Gemma3 is a good conversationalist but tends to hallucinate. Qwen3 is very smart but also very stubborn (not very steerable).
Pick up a used 3090 with more ram.
It holds it's value so you won't lose much if anything when you resell it.
But otherwise, as said, install Ollama and/or Llama.cpp and run the model using the --verbose flag.
This will print out the token per second result after each promt is returned.
Then find the best model that gives you a token per second speed you are happy with.
And as also said, 'abliterated' models are less censored versions of normal ones.
I’ve had awesome results with Qwen3-30B-A3B compared to other local LMs I’ve tried. Still not crazy good but a lot better and very fast. I have 24GB of VRAM though
Agree with what others have said: you need to try a few out. But I'd put Qwen3-14B on your list of things to try out.
Does anyone know of any local models that are capable of the same tool use (specifically web searches) during reasoning that the foundation models are?
I realize they aren’t going to be as good… but the whole search during reasoning is pretty great to have.
Look for something in the 500m-3b parameters range. 3 might push it...
SmolVLM is pretty useful. https://huggingface.co/HuggingFaceTB/SmolVLM-500M-Instruct
hf.co/bartowski/deepseek-ai_DeepSeek-R1-0528-Qwen3-8B-GGUF:Q6_K is a decent performing model, if you're not looking for blinding speed. It definitely ticks all the boxes in terms of model quality. Try a smaller quant if you need more speed.
Try out some models with LM Studio: https://lmstudio.ai/ It has a UI so it's very easy to download the model and have a UI similar to the chatGPT app to query that model.
The largest Gemma 3 and Qwen 3 you can run. Offload to RAM as many layers you can.
It's a bit like asking what flavour of icecream is the best. Try a few and see.
For 16gb and speed you could try Qwen3-30B-A3B with some offload to system ram or use a dense model Probably a 14B quant
There's a new one every day it seems. Follow https://x.com/reach_vb from huggingface.
I'm running llama3.2 out of the box on my 2013 Mac Pro, the low end quad core Xeon one, with 64GB of RAM.
It's slow-ish but still useful, getting 5-10 tokens per second.
VEGA64 (8GB) is pretty much obsolete for this AI stuff, right (compared to e.g. M2Pro (16GB))?
I'll give Qwen2.5 a try on the Apple Silicon, thanks.
I've had good luck with GPT4All (Nomic) and either reason v1 (Qwen 2.5 - Coder 7B) or Llama 3 8B Instruct.
My personal preference this month is the biggest Gemma3 you can fit on your hardware.
I use Gemma3:12b on a Mac M3 Pro, basically like Grammarly.
Captain Eris Violet 12B fits those requirements.
pretty much all Q_4 models on huggingface fit in consumer grade cards.
What about for a 5090?
this is yummy stuff
[dead]
[flagged]
If you want to run LLMs locally then the localllama community is your friend: https://old.reddit.com/r/LocalLLaMA/
In general there's no "best" LLM model, all of them will have some strengths and weaknesses. There are a bunch of good picks; for example:
> DeepSeek-R1-0528-Qwen3-8B - https://huggingface.co/deepseek-ai/DeepSeek-R1-0528-Qwen3-8B
Released today; probably the best reasoning model in 8B size.
> Qwen3 - https://huggingface.co/collections/Qwen/qwen3-67dd247413f0e2...
Recently released. Hybrid thinking/non-thinking models with really great performance and plethora of sizes for every hardware. The Qwen3-30B-A3B can even run on CPU with acceptable speeds. Even the tiny 0.6B one is somewhat coherent, which is crazy.