butler is itch.io's command-line upload tool. it lets game developers push new builds of their games quickly and reliably, and generates small patches. This enables players using the itch.io app to keep their games up-to-date with small downloads.
All itch.io developers are encouraged to use it!
Short version: users have been reporting decreased upload performance when using butler from Europe to push their itch.io builds. The 0.18.0 version solves these problems fully and everyone is encouraged to upgrade to it as soon as possible (see Upgrade instructions).
In the interest of transparency, here's a timeline of events, background on what happened and how it was resolved:
- On August 17, 2016, we got the first report of decreased performance and increased error rates. The problem was seemingly random and we had no leads on what caused it yet.
- In the next few days, other users reported having the same problem: HTTP 503 errors that interrupted the upload, or HTTP 404 after the upload finished.
- On August 19, we released a new version of butler that retried on 503 errors. This appeared to fix 503 errors, but the 404 errors were still there, which seemed to indicate that the upload was carried to completion, but the file was not present in remote storage afterwards.
- On the same day, we realized that all of the users having the problem had something in common: they were uploading from Europe. US-based users didn't seem affected by the problem. This suggested a problem not with butler, but with Google Cloud Storage instead.
- In the meantime, we reached out to Google and started to make our way trough their support system (I'd like to thank in particular our contact at firebase that helped us get to the right people quickly!).
- On August 23, Google told us they had started investigating the issue internally — their engineers started looking at upload performance and error rates based on our reports.
- On August 27, a Google developer started pointing out errors in butler's error handling and retry logic. He helped us clarify what the Google Cloud Storage docs meant (and filed a bug internally to improve these).
- After fixing butler's error handling, a new version was released, which would always successfully complete uploads, although sometimes with several retries per block, and still much slower than before August 17.
- With those errors fixed, Google engineers once again started looking at error rates and performance when uploading from the EU (since client-side errors were eliminated).
- On August 29, Google acknowledged that there could be an issue in their internal EU->US networking. The way we were doing uploads forced them to transfer blocks from EU to the US, and if that wasn't completed in a timely manner, the transfer would fail (due to an internally set deadline).
- On the same day, the same Google developer pointed out that the way we handled resumable uploads in butler (by generating the upload session on our US-based server, then uploading from wherever butler was running from) was unusual, and suggested an alternative approach.
- On August 31, we implemented the alternative approach suggested by Google, and observed pre-issue performance levels and zero errors.
- Today, September 2, the changes are fully deployed to our backend, and butler 0.18.0 fully takes advantage of the new approach, meaning faster uploads speeds for everyone (even faster than before the problem appeared, since it's now talking to a node in the same continent).
Everyone is encouraged to upgrade as soon as possible, even US users, as improved error handling benefits everyone, in the rare cases when uploading blocks still fails.
While it took longer than we're comfortable with, we're happy that the issue was resolved successfully, that butler's error handling is more solid than ever, and we feel confident moving forward, now that we have the right contacts at Google, that we'll be able to resolve any upcoming issues in a more timely fashion!
Finally, I'd like to thank itch.io users for their quality reports and understanding throughout this issue. It's a pleasure working for you all :)