Indie game storeFree gamesFun gamesHorror games
Game developmentAssetsComics
SalesBundles
Jobs

Strive for Power

Fantasy Slave Management/RPG Erotic Game · By Strive4Power

[Bug] (Linux x86_64) multiple versions - Race condition causes segfault after battle

A topic by ProofByContradiction created Sep 21, 2018 Views: 229 Replies: 5
Viewing posts 1 to 4

Versions Tested: 0.5.19c, 0.5.18d, 0.5.15d, 0.5.13b, ...

The game crashes unexpectedly with a SIGSEGV segmentation fault right before the end of battle. I first noticed this issue in the first game version where battle animations were introduced (don't remember which version). Unfortunately, as this crash seems to happen pretty rarely, it's taken me some time to try to determine what things cause the issue, which was quite difficult as the shipped binary doesn't have any debugging symbols (understandably). I've managed to reverse engineer the executable enough to guess what the issue is though, solely in hopes of this bug report being useful and not an annoying "it doesn't work".

First, some facts:
- Battle animations option can be off or on -- doesn't matter.
- It essentially never happens in early parts of the game, because...
- I think I've only ever seen this with 3 or 4 party members, almost always 4.
- It is unrelated to movement of the mouse cursor.
- The duration of the battle plays no effect.
- The speed at which you click the buttons during battle also has no effect.
- Finally, I tried slowing the game's execution down by 5-10x by having the kernel delay any syscall handling, and it ran without issue for an hour and a half before I gave up (happens after ~5 minutes while running normally).

Every single crash happens due to a memory-related issue while, or in relation to, processing the combat.gd script file. This will either be from a protection fault (due to a bad pointer value), or occasionally an instruction fault (trying to interpret garbage in memory as actual CPU code). Following are some excerpts from the debugger, heavily trimmed & massaged to fit this post.

DUMP 1: (from 0.5.19c)
Thread 1 "Strive.x86_64" received signal SIGSEGV, Segmentation fault.
0x0000000200000000 in ?? ()
=> 0x0000000200000000:  Cannot access memory at address 0x200000000
(gdb) disas
    Dump of assembler code for function _ZN7ClassDB12get_propertyEP6ObjectRK10StringNameR7Variant:
    0x000000000155bad0 <+0>:     push   %r12
    0x000000000155bad2 <+2>:     mov    %rsi,%r12
    0x000000000155bad5 <+5>:     push   %rbp
    0x000000000155bad6 <+6>:     mov    %rdi,%rbp
    0x000000000155bad9 <+9>:     push   %rbx
    0x000000000155bada <+10>:    mov    %rdx,%rbx
    0x000000000155badd <+13>:    sub    $0x60,%rsp
    0x000000000155bae1 <+17>:    mov    %fs:0x28,%rax
    0x000000000155baea <+26>:    mov    %rax,0x58(%rsp)
    0x000000000155baef <+31>:    xor    %eax,%eax
    0x000000000155baf1 <+33>:    mov    0x70(%rdi),%rax
    0x000000000155baf5 <+37>:    test   %rax,%rax
    0x000000000155baf8 <+40>:    jne    0x155bb00 <ClassDB::get_property(Object*, StringName const&, Variant&)+48>
    0x000000000155bafa <+42>:    mov    (%rdi),%rax
    0x000000000155bafd <+45>:    callq  *0x30(%rax)
    => 0x000000000155bb00 <+48>:    mov    0x85fb69(%rip),%rdi        # 0x1dbb670 <ClassDB::classes>
(gdb) info registers
    rax            0xb97d580        0xb97d580
    rbx            0x7fffffffc310   0x7fffffffc310
    rcx            0x7fffffffc590   0x7fffffffc590
    rdx            0x7fffffffc310   0x7fffffffc310
    rsi            0x9d0bf48        0x9d0bf48
    rdi            0xad29480        0xad29480
    rbp            0xad29480        0xad29480
    rsp            0x7fffffffc270   0x7fffffffc270
    r8             0x99d8510        0x99d8510
    r9             0x9d0bf40        0x9d0bf40
    r10            0x1dbb300        0x1dbb300
    r11            0x11     0x11
    r12            0x9d0bf48        0x9d0bf48
    r13            0x9d0bf48        0x9d0bf48
    r14            0x7fffffffc590   0x7fffffffc590
    r15            0x9b0e090        0x9b0e090
    rip            0x155bb00        0x155bb00 <ClassDB::get_property(Object*, StringName const&, Variant&)+48>
(gdb) x/8g $rsi
    0x9d0bf48:      0x0000000002245410      0x0000000000000030
    0x9d0bf58:      0x0000000000040021      0x0000000000040000
    0x9d0bf68:      0x0000a1b300000001      0x000000650000000a
    0x9d0bf78:      0x0000007400000078      0x0000006e00000065
(gdb) x/8g 0x2245410
    0x2245410:      0x00007fff0000000b      0x0000000000000000
    0x2245420:      0x0000000002243ce0      0x9aa1805700000057
    0x2245430:      0x0000000009b11290      0x0000000001fe5060
    0x2245440:      0x0000000000000000      0x0000000000000051
(gdb) x/8g 0x2243ce0
    0x2243ce0:      0x0000006500000072      0x0000007400000063
    0x2243cf0:      0x000000670000005f      0x0000006f0000006c
    0x2243d00:      0x0000006100000062      0x0000005f0000006c
    0x2243d10:      0x0000006f00000070      0x0000006900000073
(gdb) x/s 0x2243ce0
    "rect_global_position"

This happens while processing the chunk of combat.gd that starts 'extends Node\n\nvar period = '. In this case, it's in 'ClassDB::get_property(Object*, StringName const&, Variant&) ()' trying to call the getter method for the child class, but it blows up during the call because the vtable has bad data. This suggests the object was very recently destroyed and that at least the child's destructor had already run.

I've also seen this during execution of 'Variant::evaluate(Variant::Operator const&, Variant const&, Variant const&, Variant&, bool&)+40325', implying a crash occurred while performing some operation on 'rect_global_position'. I think it was either an addition or a floor(), but I wasn't sure.

Finally, I decided to run the game but have the heap's memory reset to known values after alloc & free:

DUMP 2: (from 0.5.19c)
$ valgrind --show-mismatched-frees=no --undef-value-errors=no --malloc-fill=0xbb --free-fill=0xee ./Strive.x86_64
OpenGL ES 3.0 Renderer: NVE7
Boot splash path: res://files/buttons/loading.png
==20622== Invalid read of size 8
==20622==    at 0x15B5A47: Object::get(StringName const&, bool*) const [clone .constprop.10155] (in Strive.x86_64)
==20622==  Address 0x3171c620 is 80 bytes inside a block of size 1,128 free'd
==20622==    at 0x48369EB: free (in vgpreload_memcheck-amd64-linux.so)
==20622==    by 0xAD4966: SceneTree::_flush_delete_queue() (in Strive.x86_64)
==20622==    by 0x1FFEFFF3DF: ???
==20622==    by 0xF84941F: ???
==20622==    by 0x1FFEFFF3DF: ???
==20622==    by 0xB162A7: SceneTree::idle(float) (in Strive.x86_64)
==20622==    by 0x41EFFFFF00000007: ???
==20622==    by 0x205105CF: ???
==20622==    by 0x3E4AAAC100FFFFFF: ???
==20622==    by 0x1FFEFFF307: ???
==20622==  Block was alloc'd at
==20622==    at 0x48357BF: malloc (in vgpreload_memcheck-amd64-linux.so)
==20622==    by 0x158DBD8: Memory::alloc_static(unsigned long, bool) [clone .constprop.10209] (in Strive.x86_64)
==20622==    by 0x1FFEFFE4AF: ???
==20622==    by 0x1FFEFFE4A7: ???
==20622==
==20622== Invalid read of size 8
==20622==    at 0x15B5A6D: Object::get(StringName const&, bool*) const [clone .constprop.10155] (in Strive.x86_64)
==20622==  Address 0xeeeeeeeeeeeeeeee is not stack'd, malloc'd or (recently) free'd
==20622==
==20622== Process terminating with default action of signal 11 (SIGSEGV)
==20622==  General Protection Fault
==20622==    at 0x15B5A6D: Object::get(StringName const&, bool*) const [clone .constprop.10155] (in Strive.x86_64)

The only way that it could be trying to read 0xeeeeeeeeeeeeeeee is by reading after a free(), indicating a race condition. The "rect_global_position" seemed to be godot magic relating to positioning of how something is being drawn? Anyway, I don't really have any experience with how godot works, but one single time I was extremely lucky and actually got a printed error on 0.5.18d:

DUMP3: (from 0.5.18d)
$ ./index.x86_64
OpenGL ES 3.0 Renderer: NVE7
Boot splash path: res://files/buttons/loading.png
ERROR: emit_signal: Error calling method from signal 'meta_hover_ended': 'Panel(statspanel.gd)::_on_traittext_meta_hover_ended': Method expected 1 arguments, but called with 0.
   At: core/object.cpp:1202.
Segmentation fault

I don't know if this means there's an API mismatch, or if it's a misleading error just saying 0 arguments because it found a NULL pointer or something...

Anyway, this is about all I can get without debug symbols (given I don't wish to spend a couple solid weeks decompiling godot's parsing engine. Sorry for not finding the actual needle in this haystack, but I'm hoping I got close enough that you might have an idea of why this is happening.

Sorry, I'm not using linux neither compile it by hand. The linux version is automatically assembled by godot engine and its outside my field of competition.

Okay, I think there are actually two distinct but related problems -- both race conditions (!) -- that are compounding and resulting in "random" crashes during battle. The good news is I *think* I've been able to solve the problem. After my changes to combat.gd, I haven't seen a single crash. That said, I haven't had nearly as much time to test as I had before.

I'll apologize in advance for any inaccuracies in what follows  I had no prior experience with godot script (I work in assembly or C/C++ usually), and the combat script is somewhat spaghettified, but I think I understand it to a moderate degree now.

Rather than processing each side fully before switching to process the other, the relevant handling is broken into a couple coroutines, which can yield until execution jumps back after some engine event occurs. The player's updates happen in the useskills() function, and the enemy's happen in enemyturn(). But there's a dependency here, because enemyturn() itself calls useskills(), possibly multiple times. This could be okay, but for this to be reliable both functions would have to be waiting on the same event. Unfortunately, there are 3-4 events involved within these functions. Since the stats logic is so deeply nested, the same pattern of yields isn't guaranteed to be consistent. Finally, on top of that, one or two of these events gets fired off in response to input or UI updates from their respective threads (full, proper threads, not coroutines). The event notification will signal the coroutine and execute other remaining handler code *while the other coroutine segment is running*. This is the race condition.

This would particularly be an issue for containers that happen to be resized ever, though all global variables would be fine. This would explain basically all the crashes I had earlier. Godot docs suggest explicit mutexes, but I instead chose to ensure better guarding around the problem areas and not let anything run until fully ready.

Fixes 1 and 2 below fix my crash.

I found a second race condition occurring at the beginning of the turn -- first time I've ever seen it not happen at the end of battle. This one's actually a race condition in the true sense -- explicit timers for damage updates are involved. Fix 3 below resolves this problem too.

This code is from 0.5.19c. I've lightly massaged this file so cited line numbers aren't exact. I'll add a .patch file with exact numbers.

combat.gd Fix 1. Guard checks & period update before jumping coroutine locations. Around line 908, turn this

        self.combatlog += '\n' + combatantdictionary(caster, target, text)
        emit_signal("skillplayed")
        endcombatcheck()
        if period == 'win':
                playerwin()
        if period == 'skilluse':
                period = 'base'

into:

        self.combatlog += '\n' + combatantdictionary(caster, target, text)
        endcombatcheck()
        if period == 'win':
                playerwin()
        if period == 'skilluse':
                period = 'base'
        emit_signal("skillplayed")

combat.gd Fix 2. Same reasoning/fix applied but for more-problematic thread notifications Around line 1230, change

func damagein():
        emit_signal("damagetrigger")
        ongoinganimation = false
        yield(get_tree(), 'idle_frame') func tweenfinished():
        yield(get_tree(), 'idle_frame')
        emit_signal("tweenfinished")
        ongoinganimation = false

to:

func damagein():
        yield(get_tree(), 'idle_frame')
        ongoinganimation = false
        emit_signal("damagetrigger") func tweenfinished():
        yield(get_tree(), 'idle_frame')
        ongoinganimation = false
        emit_signal("tweenfinished")

combat.gd Fix 3. Ensure delaydamage always happens immediately, still drawing slower if requested. Perform a full yield to the top modelview handler before actually performing the callback, completing the current frame first. Around line 1264.

        if instantanimation == true:
                for i in timings:
                        timings[i] = 0.05
                timings.delay2 = 0         globals.main.sound('attack')         ongoinganimation = true
        if combatant.group == 'enemy':
                change = -change
        tween.interpolate_property(node, "rect_position", pos, Vector2(pos.x, pos.y-change), timings.speed1, Tween.TRANS_ELASTIC, Tween.EASE_IN_OUT)
        tween.interpolate_callback(self, timings.delaydamage, 'damagein')
        tween.interpolate_property(node, "rect_position", Vector2(pos.x, pos.y-change), pos,  timings.speed2, Tween.TRANS_LINEAR, Tween.EASE_IN_OUT, timings.delay2)
        tween.interpolate_callback(self, timings.delayfinish, 'tweenfinished')
        tween.start()

becomes:

        if instantanimation == true:
                for i in timings:
                        timings[i] = 0.05
                timings.delay2 = 0         timings.delaydamage = 0.0
        globals.main.sound('attack')         ongoinganimation = true
        if combatant.group == 'enemy':
                change = -change
        tween.interpolate_property(node, "rect_position", pos, Vector2(pos.x, pos.y-change), timings.speed1, Tween.TRANS_ELASTIC, Tween.EASE_IN_OUT)
        tween.interpolate_deferred_callback(self, timings.delaydamage, 'damagein')         tween.interpolate_property(node, "rect_position", Vector2(pos.x, pos.y-change), pos,  timings.speed2, Tween.TRANS_LINEAR, Tween.EASE_IN_OUT, timings.delay2)
        tween.interpolate_deferred_callback(self, timings.delayfinish, 'tweenfinished')
        tween.start()

I didn't go through and check I got each possible case. I haven't seen any problems after doing this, though. I definitely ask those interested to help test. I also welcome any comments from the author, whom I'd expect would know this code better than anyone.

Cheers.

Pardon for not responding. I've tested your fixes and they seem to work fine. I'm not a fan of no delay for attack though but otherwise I'll apply it in next patch.

The issue happens in the script processing thread while drawing the panel UI, so this problem isn't Linux-specific and would be present in the Windows build too. It's very similar to this issue: https://itch.io/t/299710/bug-game-crashing-ayneris-sword-double-atk since the crash always happens at the very end of a battle. However, it happens regardless of weapon, even if nobody has a weapon that has a chance for a second attack.

I would guess the issue from the other thread is probably the same issue, and it's actually a UI drawing issue (i.e., trying to reference parts of the panel after it's been destroyed).

Here's a full .patch file to auto-apply the above changes to 0.5.19c code. Probably works fine for 0.5.19d as well if there aren't major changes around those parts of the file. Just run the usual

patch -p0 < whatever_you_saved_it_as.patch

in the game's root and you're good to go.

.patch file (sorry, don't have any better means of sharing this):

--- files/scripts/combat.gd    2018-09-11 13:29:25.000000000 -0700
+++ files/scripts/combat.gd    2018-09-26 00:22:53.290045306 -0700
@@ -905,15 +905,16 @@
     if skill.target == 'all':
         target = targetarray
     
     self.combatlog += '\n' + combatantdictionary(caster, target, text)
-    emit_signal("skillplayed")
     endcombatcheck()
+
     if period == 'win':
         playerwin()      if period == 'skilluse':
         period = 'base'
         
+    emit_signal("skillplayed")
 
 func scripteffect(caster,target,script):
     globals.abilities.call(script, caster, target)
     
@@ -1218,16 +1219,16 @@
 signal defeatfinished
 var ongoinganimation = false
 
 func damagein():
-    emit_signal("damagetrigger")
-    ongoinganimation = false
     yield(get_tree(), 'idle_frame')
+    ongoinganimation = false
+    emit_signal("damagetrigger")
 
 func tweenfinished():
     yield(get_tree(), 'idle_frame')
-    emit_signal("tweenfinished")
     ongoinganimation = false
+    emit_signal("tweenfinished")
 
 
 func defeatfinished():
     emit_signal("defeatfinished")
@@ -1244,18 +1245,21 @@
         for i in timings:
             timings[i] = 0.05
         timings.delay2 = 0
     
+    timings.delaydamage = 0.0
+
     globals.main.sound('attack')
     
     
     ongoinganimation = true
     if combatant.group == 'enemy':
         change = -change
     tween.interpolate_property(node, "rect_position", pos, Vector2(pos.x, pos.y-change), timings.speed1, Tween.TRANS_ELASTIC, Tween.EASE_IN_OUT)
-    tween.interpolate_callback(self, timings.delaydamage, 'damagein')
+    tween.interpolate_deferred_callback(self, timings.delaydamage, 'damagein')
+
     tween.interpolate_property(node, "rect_position", Vector2(pos.x, pos.y-change), pos,  timings.speed2, Tween.TRANS_LINEAR, Tween.EASE_IN_OUT, timings.delay2)
-    tween.interpolate_callback(self, timings.delayfinish, 'tweenfinished')
+    tween.interpolate_deferred_callback(self, timings.delayfinish, 'tweenfinished')
     tween.start()
 
 func slamanimation(combatant):
     var tween = $Tween