Post by HammerAI in HammerAI | Free & unlimited AI chat running on your computer comments

HammerAI | Free & unlimited AI chat running on your computer » Comments

Viewing post in HammerAI | Free & unlimited AI chat running on your computer comments

Glad to hear it's better! I really need to get this update out to users. But there is one bug I know about that before I can launch.

Linux AMD ROCm support was just in case I had any users, I wanted to make sure it was awesome for them! Glad to hear that day happened faster than I expected.

Will definitely add some more models, I'm pretty behind. Any specific suggestions? Would love to hear what the best stuff is nowadays.

I will learn more about IQ quants and the KV cache offloading. Is that suggestion for the local LLMs, or the cloud-hosted ones?

Anyways, happy it's better. If you want to chat more, I'm hammer_ai on Discord - would be fun to chat more about finetunes to add / any other suggestions you have.

AlucardNoir274 days ago(+1)

For the desktop version. Basically, there are two way to use AI, one with slowly building KV cache over yeach new prompt and reply and one where the user sends the entire conversation back to the AI to be processed with each new prompt to get a new response. On desktop it's faster to use KV cache than it is to reprocess the entire conversation again and again. Thing is, the KV cache can be separate from from rest of the model. IF the offload to VRAM option is used and there is enough VRAM it's always faster BUT if there isn't enough VRAM for the desired KV cache size then part of the model and part of the KV cache are in VRAM and the rest are in RAM and this is always slower.

If you can fit everything in VRAM you're at say 21 tokens per second, with only the model in VRAM and KV cache in RAM you'd be at around 15, and with part of the model and part of the KV cache in VRAM and the rest in RAM you could go as low as 10 or even 5 tokens per second. So it's always preferable to only load the model in VRAM and let the KV cache in RAM if you can't fit everything in VRAM. For the website, since you're only using 4k context window as long as everything fits in VRAM, I wouldn't touch it - if it ain't broke don't fix it and whatnot. But on desktop, allowing us to keep KV cache only in RAM or offload it to VRAM, can significantly increase performance.

As for recommended models. I'd say move the Nous models to Hermes 3 (nonthinking) look into the ArliAI RPMax v.1.3 series of models (4 models at 4 sizes, based on 3 different bases, Llama 3.1, qwen 2.5 and Mistral Nemo), and the latest Latitudegames models. I'm using Wayfarer 12b for RP and Muse 12b for story writing (latitudegames models) but they have larger models and again, all open source and on Huggingface. Dreamgen is also doing interesting stuff, but their older stuff is, well, older, and the new model - Lucid - is still in beta and fairly bad at following instruction.

But yeah, try Wayfarer, at least for me it's significantly superior to the Drummer Rocinante you have as a default option. I get actual RP responses from it wile Rocinante 12b wants to just continue my own posts 90% of the time. Also, I'd probably remove the thinking models from the default options. Honestly, most people are not going to have the kind of hardware to run them at high enough speeds to make the thinking steps worth it - at least not on desktop. Especially for smaller models that even unquantize can still catch themselves in an infinite thinking loop.

Overall, I'd try to find finetunes and test them if I were you. What I recommended is what I tested and found to be an improvement over what came before. I'd stay away from mergers, ablated and uncensored models. Just try to find RP and story finetunes that are open source and on huggingface to test. Also, and you did not hear this from me, try IBM's Granite 3.3 8B model... for a model designed for office work and which was instruction trained to be harmless and safe boy does it follow nsfw instructions well. And I do mean NSFW. And it's Apache 2.0 :)

As for IQ quants, they can offer similar performance to KV cache at smaller sized - but are only similarly fast under ROCm and CUDA - significant slowdowns under Vulkan and CPU. I know Ollama supports them, though I don't think you can DL IQ quant from their site directly. An IQ4_XS should be very similar to a Q4_KS in output - within margin or error for RP and story purposes - but substantially smaller.

itch.io

Viewing post in HammerAI | Free & unlimited AI chat running on your computer comments