After playing around with it for a bit, I do think the best solution would be to improve the Translation models.
I use an Open Source voice dictation app that uses the FasterWhisper and the Ctranslate2 library, which is wicked fast (less than 500ms to transcribe and translate more than 30secs of voice recording) on a RTX4070. I’m saying this because I saw that you were unsure of how heavy models would play. I’m confident there must be a way to try out CUDA compatible implementations to see how a “heavier” model would run.

