No problem at all for the delay, and those features as described sound like they'd provide pretty complete coverage of what I'd be hoping to see.
If it does already re-capture the selected area, I have not seen it automatically updating the translation as the text changes. It may be that the interval is longer than the time I have allowed (though I did leave it for over a minute).
My method has been to use Desktop mode and select a region using the default hotkey, L-CTRL. I believe the version I have installed is the latest, 0.3.5, and I do not have any other overlay systems in place that would be likely to interfere.
The system is, however, Windows 11, which I understand has not been thoroughly tested, but if there is a way to get lower-level debugging logs, I can experiment and share anything interesting with you, like if it seems to be the case that the capture-region drifts or that the operation fails for some reason. It's been a while since I last looked into Windows graphics calls, but I'm no stranger to unprocessed streams of verbose data.
Sorry, I just re-read your post and saw that you were clear that it does re-poll the area, but that it doesn't automatically act on the changes. And yeah, I can imagine why it would be tricky to make it smooth.
Keeping the panel where the translation appears in the exact same place that the user has positioned it should help (anchored to a top-left origin should be intuitive enough), but adjusting for width, height, and, depending on the game, the possibility of some background animation triggering a false-change in the OCR result might lead to unwanted flickering.
Maybe that could be mitigated by diffing the strings to see if the delta is more than one or two or some configurable number of characters, or a levenshtein edit-distance in languages where that makes sense.
UX isn't really my thing, unfortunately, but I can draft a parsing function or logical flowchart for something if that might be useful to work through an idea.