With the types of questions and answers, I'm not sure why you needed to use an llm instead of a traditional text parser. I'd imagine that an llm might work better to mark answers correct when they get the gist of it right, but I had the same issue as the other commenter with the inputs being extremely picky about wording/prasing. (For example it didn't accept 'scary movies' for one puzzle or 'they have to/are forced to' for another, despite those being basically the answer)
If it's going to be that picky, there's no reason not to use a traditional parser.