Posted February 26, 2021 by Norsedale
Build #241 is an (almost) transparent update from the clients perspective. It fixes a subtle backend issue that has probably been teasing us for a while in various ways. This might get a little technical, but for those who are interested in the nitty gritty details:
Imagine that a soldier is attacking an enemy base - this involves (at least) two operations: 1) Removing the troop from your base and 2) Placing the troop in transit towards the enemy base. If the server went down right between those two points in time, the soldier would be in neither of those two places and would essentially be lost. Not good.
To prevent this from happening, the backend uses a system of "periodic snapshots of atomic changes" or "transactions". An "atomic" change means that it either happens completely or not at all. Since transactions are time consuming, we don't want to do a transaction on every single little thing that happens in the game, so we take periodic (we're talking milliseconds here) snapshots where atomic changes to the games state are committed to persistent storage.
To make sure this either happens in full or not at all, we rely on an underlying feature supported by most operating systems: Atomic moves. It basically means that moving a file from one directory to another either happens completely or it does not happen at all - the file is never half moved (Copying a file, for example, is not atomic, because the bytes are moved one by one to their new location, so if you interrupt a file copy you end up with half a file).
Now, this has been working fine for ages - never had any issues when stress testing, and the public server has also worked fine - but lately we started getting weird notifications of failed transactions and rollbacks due to corrupt files. Upon further inspection the files were fine, but it turned out that the transaction system had only been able to read 16384 bytes of a much larger file. That number is interesting because it's the size of a file chunk - meaning the Operating System was not performing an atomic move, but rather a copy, and we sometimes ended up reading the file when only the first chunk had been copied (the game code runs orders and orders of magnitude faster than the I/O system so it's a very likely thing to have happen).
So why was this suddenly happening?
Well, the file that was being moved was first created in a temporary location. To get this location we used the built-in system function for getting a temporary file because it has the nice benefit of automatically being cleaned by the operating system if, for some reason, we fail to delete it. Turns out that on the cloud service that runs the actual hardware for Deep Rift 9, the default temporary folder is not on the same physical drive as the game. Since atomic moves are implemented by simply changing a file pointer without actually moving any data, it does not (generally) work across physical hardware boundaries, and our atomic move turned into a regular slow and fragile copy...
It started happening now because a specific file had grown to a size that is larger than the chunk size, but it has always caused a completely unnecessary performance hit doing slow copies instead of fast moves.
The easy fix, and lesson learned, was to make sure temporary files are always created on the same drive as the target file :).