Skip to main content

Indie game storeFree gamesFun gamesHorror games
Game developmentAssetsComics
SalesBundles
Jobs
TagsGame Engines

Thanks for the feedback! So right now our Ollama version is actually really old. Does it work better if you use this beta version? It updates Ollama and should be MUCH faster: https://github.com/hammer-ai/hammerai/releases/tag/v0.0.206

Yes and no. On the one hand when the models when they run they do run better, on the other hand it's still impossible to run the model set as character model and I have to use the custom model option. The good news is that it does run faster and that at least on custom I haven't encountered the issues I had before where it just wound't run.


BTW, did you put the linux AMD ROCm support in just for me as your one known linux user or is HammerAI actually detecting that I have an AMD AI capable CPU with iGPU besides the Nvidia GPU it's using right now? Because if the latter that's actually impressive - ROCm support is so spotty on linux is might as well not be there. The 780m from AMD is a lot weaker than the 4060 so I don't think It will see much usage, but I might try bigger models just to see how it behaves if Hammer can actually use the AMD iGPU natively.

PS. Please add a few newer RP models. Some of your competitors have a few finetunes that are open source licenses. ArliAI, Latitudegames, Dreamgen. Please add a few newer Nemo finetunes. Also, and this is just up to you, consider IQ quants and do not offload KV cache to VRAM. IQ quants can make 8GB fully enough to 100% offload most non-gemma 12B models, as long as one doesn't also try to offload KV cache to the Vram. That's in case you're not doing this already.


Anyways, cheers and thanks for the new Ollama update, it did in fact help.

(1 edit)

Glad to hear it's better! I really need to get this update out to users. But there is one bug I know about that before I can launch.

Linux AMD ROCm support was just in case I had any users, I wanted to make sure it was awesome for them! Glad to hear that day happened faster than I expected.

Will definitely add some more models, I'm pretty behind. Any specific suggestions? Would love to hear what the best stuff is nowadays.

I will learn more about IQ quants and the KV cache offloading. Is that suggestion for the local LLMs, or the cloud-hosted ones?

Anyways, happy it's better. If you want to chat more, I'm hammer_ai on Discord - would be fun to chat more about finetunes to add / any other suggestions you have.

(+1)

For the desktop version. Basically, there are two way to use AI, one with slowly building KV cache over yeach new prompt and reply and one where the user sends the entire conversation back to the AI to be processed with each new prompt to get a new response. On desktop it's faster to use KV cache than it is to reprocess the entire conversation again and again. Thing is, the KV cache can be separate from from rest of the model. IF the offload to VRAM option is used and there is enough VRAM it's always faster BUT if there isn't enough VRAM for the desired KV cache size then part of the model and part of the KV cache are in VRAM and the rest are in RAM and this is always slower. 

If you can fit everything in VRAM you're at  say 21 tokens per second, with only the model in VRAM and KV cache in RAM you'd be at around 15, and with part of the model and part of the KV cache in VRAM and the rest in RAM you could go as low as 10 or even 5 tokens per second. So it's always preferable to only load the model in VRAM and let the KV cache in RAM if you can't fit everything in VRAM. For the website, since you're only using 4k context window as long as everything fits in VRAM, I wouldn't touch it - if it ain't broke don't fix it and whatnot. But on desktop, allowing us to keep KV cache only in RAM or offload it to VRAM, can significantly increase performance.


As for recommended models. I'd say move the Nous models to Hermes 3 (nonthinking) look into the ArliAI RPMax v.1.3 series of models (4 models at 4 sizes, based on 3 different bases, Llama 3.1, qwen 2.5 and Mistral Nemo), and the latest Latitudegames models. I'm using Wayfarer 12b for RP and Muse 12b for story writing (latitudegames models) but they have larger models and again, all open source and on Huggingface. Dreamgen is also doing interesting stuff, but their older stuff is, well, older, and the new model - Lucid - is still in beta and fairly bad at following instruction. 

But yeah, try Wayfarer, at least for me it's significantly superior to the Drummer Rocinante you have as a default option. I get actual RP responses from it wile Rocinante 12b wants to just continue my own posts 90% of the time. Also, I'd probably remove the thinking models from the default options. Honestly, most people are not going to have the kind of hardware to run them at high enough speeds to make the thinking steps worth it - at least not on desktop. Especially for smaller models that even unquantize can still catch themselves in an infinite thinking loop. 

Overall, I'd try to find finetunes and test them if I were you. What I recommended is what I tested and found to be an improvement over what came before. I'd stay away from mergers, ablated and uncensored models. Just try to find RP and story finetunes that are open source and on huggingface to test. Also, and you did not hear this from me, try IBM's Granite 3.3 8B model... for a model designed for office work and which was instruction trained to be harmless and safe boy does it follow nsfw instructions well. And I do mean NSFW. And it's Apache 2.0 :)


As for IQ quants, they can offer similar performance to KV cache at smaller sized - but are only similarly fast under ROCm and CUDA - significant slowdowns under Vulkan and CPU. I know Ollama supports them, though I don't think you can DL IQ quant from their site directly. An IQ4_XS should be very similar to a Q4_KS in output - within margin or error for RP and story purposes - but substantially smaller.