Does your GPU have 24+ gb of vram?
There also seems to be an unfixed backend issue with Vulkan/AMD GPUs for this specific model. I believe the workaround they currently use might result in context processing so slow that the API call times out, but I have no way of testing or fixing this.
After testing the model on cloud based 3090s/5090s, I've come to the conclusion that the upstream GPU implementation for this model is completely broken.
This didn't occur to me during testing as I don't own a GPU with 24 gb of vram myself, and as such did all the testing using slow CPU inference only.
I will look for an alternative 24 gb model, and add GLM-4 back in once it's properly implemented upstream.