GLM-4 seems to give up when i try to talk to npc's just ... Forever it seems. then a lmm error or something happens
After testing the model on cloud based 3090s/5090s, I've come to the conclusion that the upstream GPU implementation for this model is completely broken.
This didn't occur to me during testing as I don't own a GPU with 24 gb of vram myself, and as such did all the testing using slow CPU inference only.
I will look for an alternative 24 gb model, and add GLM-4 back in once it's properly implemented upstream.