TTS in Godot: advice and current limitations

May 03, 2024 · by Callum Deery (@CallumDeery2)

Share on Bluesky Share on Twitter Share on Facebook

Follow Callum DeeryFollowing Callum DeeryUnfollow Callum Deery

Share this post:

Share on Bluesky Share on Twitter Share on Facebook

Follow Callum DeeryFollowing Callum DeeryUnfollow Callum Deery

Godot has access to platform text to speech generators, which is a super useful feature for making games accessible for blind players. There are however some limitations to the system and things to be aware of, so here's a little guide on the implementation of TTS and audio description systems in Godot.

Code Design:

The approach I've found works best is to create a TTS handler node and class which is setup to autoload (so it can be accessed across scenes). The handler can then contain settings such as voice, pitch, and volume, and provide an interface for the rest of the game to request TTS and also stop it when necessary.

This means you can expand the system without having to change functionality in a large number of places across your code. At its most basic then handler will just have functions to request speech and to stop it, but it's likely you’ll want to expand this logic based on the needs of your game and limitations of the platform api(s).

At minimum you want the handler to have text requested to be spoken at different priorities (whether they’re added to the list of speech or whether they interrupt current speech), utilising the underlying API’s queue rather than dealing with it yourself.

Minimum example:

class_name TTSManager extends Node  
var enabled:bool = true; 
var voice:int = 0; 
var vString = DisplayServer.tts_get_voices_for_language("en")[voice] 
var volume:int = 50; var speed:float = 1;  
func addText(text:String,interrupt:bool):    
    if(enabled):         
        if(interrupt):            
            stop();
        DisplayServer.tts_speak(text,vString,volume,1.0,speed);
func stop():     
    DisplayServer.tts_stop();

To work around some of the considerations I’ll talk about later and to support more player control of speech playback (such as replaying previous speech, pausing, etc), you’ll likely want to handle the speech queue yourself, rather than relying on the API queue

More advanced example:

var Queue[];
var Played[]; ##You likely want to limit the length of this, for example to the 10 previous strings
func addText(text:String,interrupt:bool):         
    if(enabled):                  
        if(interrupt):
            stop();           
        Queue.push(text);
func stop():
    While(Queue.length >0): ##Clearing the Queue but allowing it to be played back
        Played.push(Queue.pop())          
    DisplayServer.tts_stop();
##Hook to the Utterance finished callback
func onPreviousSpeechEnded():
    if(Queue.length > 0):
        var text = Queue.pop();
        DisplayServer.tts_speak(text,vString,volume,1.0,speed);
        Played.push(text)
##Respond to player request to replay previous speech
func rewind():
    DisplayServer.tts_stop();
    Queue.append(Played.pop()) ##push the current speech onto the queue
    var text = played.pop()
    ##speak the previous text
    addText(text,true);

Voicing objects

The next step for audio description is enabling objects in the game to be read out. On the simple end this is UI objects reading their text, but can extend to whole environments.

There are two broad ways to approach this, both of which I think are equally reasonable:

(Although the second approach will probably see more use of TTS is added mid-developed)

Extension from a “readable” class:

Create a class with a function which can call the TTS handler with an appropriate string of text, this is them extended for objects which are intended to be readable and trigger in appropriate conditions (cursor over etc)

Pass to TTShandler:

Within the TTS handler create a function for each readable object, with the handler pulling data from the object to construct the tts string when appropriate.

Cellular City uses the second approach for it's map reading, partially as multiple objects (tilemap and buildings) are needed to provide a complete description of levels.

Map reading example:

func readMap(playRegion:Array[Vector2i],map:TileMap,level:LevelManager):
     addText("Describing City",false)
     for y in height+1:       ##Height and width of the region being read  
        for x in width+1:             
            var tempVector = topLeft+Vector2i(x,y);             
##Separate calls reading method             
            if(playRegion.has(tempVector)):
                 readtile(tempVector,map,level);                 
##Alternative Method, reads much faster as a single string sent at the end            
            if(playRegion.has(tempVector)):                 
                speechString += readTile2(tempVector,map,level) + " "
       addText(speechString,false)

Considerations:

Keyboard support:

For starters make sure to have keyboard navigation available for all parts of the game. This is because mice are typically pretty inaccessible for blind players.

For UI specifically this will involve grabbing focus upon opening a new menu and scene. Focus isn't grabbed by default and there's no way for a player to force it on their own, but once it's grabbed the keyboard can be used to control the menus. There's also the auto-generated focus flow which can be serviceable, but can often benefit from hand tuning. (Keyboard controls for UI can be viewed and changed in the Input-Map with “Show Built-in actions enabled”)

Focus is your friend:

The control nodes which power Godot’s UI system all have a signal which triggers when the node gains “focus”. I’ve found this is a good way to have responsive TTS, with each node requesting speech (and interrupting previous speech) when it gains UI focus.

Focus problems:

While the above point about focus is true in general, focus can behave in non-ideal ways with more complex control nodes (and unfortunately this isn’t super well documented at the moment)

Examples of elements I’ve run into issues with:

Tab bars: Don’t send a focus event when the current tab is reselected
Item lists: Items within the don’t seem to have proper focus, but this can be worked around with “selected” signals

Variable performance:

This is the big one: Because godot hooks into the platform API rather than having an internal system there's some pretty large variation in performance (and as the docs points out, linux might not have support at all unless a user has added it, which might cause problems on the steam deck).

This variation includes both options (for example Chrome has 4 more voices than Windows by default) but also reaction to multiple requests. I’ve found requesting around 10 short sentences in quick succession causes speech to be paused until the player makes another input (slightly strange behaviour I know) but this seems to be very dependent on what else the player's pc is doing at the time.

A workaround I’ve found helps alleviate this is to gather a set of requests into a longer sentence which is then sent across in one go, but bear in mind this will be read much faster to the player so it might be more difficult to understand.

(i’ve not had the chance to test Android or IOS performance)

Audio busses:

Because the speech isn’t technically coming from Godot it doesn’t go through the engine’s audio bus system, which might cause issues with levels/priority if you’ve got a game with a complex audioscape.

Proper nouns and unique words:

Another issue can be systems struggling to voice proper nouns or words unique to your game. I ran into this problem when I generated some character names from the Scottish record of births and deaths, my god did the voice generation struggle. This can also happen if you've got any made-up or uncommon words (sorry to your fantasy magic system).

Giving voice choices can help reduce this, but that is somewhat limited to which voices are on a users system.

Alternative approaches:

Nightblade’s NVDA plugin

This plugin hooks into NVDA (a common open source screen reader), and uses the system TTS as a fallback. While its windows only at the moment it can be good for reducing first-time setup for players as speech can use their already configured setup:

https://godotengine.org/asset-library/asset/2519

Pre-recording

If you've got a relatively constrained amount of speech you can take an alternative approach of pre-recording all the speech you’ll need and then playing that back on request rather than generating on the go (I’m currently recording them for Cellular City). You could even record the TTS system speaking everything you need beforehand if you don’t have a microphone/aren’t confident with audio recording.

I’d still have system TTS available as an option however, as it might be easier for a player to parse and adjust as a fallback.

itch.io

TTS in Godot: advice and current limitations

Code Design:

Voicing objects

Considerations:

Keyboard support:

Focus is your friend:

Focus problems:

Variable performance:

Audio busses:

Proper nouns and unique words:

Alternative approaches:

Nightblade’s NVDA plugin

Pre-recording

Further reading:

Support this post

In this post

Games for Blind Gamers 1

More posts

Leave a comment