Post by DaveSan in HammerAI | Free & unlimited AI chat running on your computer comments

I've recently been interested in setting up llamas on windows, using KoboldAI and Silly Tavern as the medium to hook into it. The main thing I found was that fully offloading the model on to your VRAM is an insane boost in performance and the sort of 'end goal' you want to have with loading models. Having high RAM is a bit of a red herring since the ram speeds are meh in comparison. Even if it's just 1 layer you'll feel the difference.

For 7b models, you'd need at minimum 6gb vram. And even then, you'd probably need a model of about 3gb large because not all 7b models are born equal.

So is this something your software will do automatically? In terms of figuring out context sizes, layers, blasbatchsizes, etc, to determine the optimal loadout for the best speeds? Are different quantasized versions of models available depending on VRAM availability?

HammerAI2 years ago(+1)

Hey! That's a good question and seems like something we should support, though right now we just have some hard-coded presets. Happy to chat more about it in our Discord if you're interested: https://discord.gg/kXuK7m7aa9

itch.io

Viewing post in HammerAI | Free & unlimited AI chat running on your computer comments