2025.7 Global Test: Adjustment Summary
After deploying the system to AWS, we noticed an abnormal increase in memory usage within the first few minutes—an issue that didn’t occur during previous tests.
Our initial suspicion pointed to the underlying network layer’s memory handling for packets. Originally, the system was designed for internal use, requiring application-level developers to manually split large packets into fixed sizes before transmission. However, as we transitioned the system for external developer use, we re-evaluated this design and found it unfriendly. We revised the behavior to support packets of arbitrary sizes: when a packet exceeds the preset size, the system dynamically allocates extra memory and releases it immediately after sending.
We then tried introducing jemalloc
to mitigate fragmentation. However, memory usage increased at nearly double the original rate. After two consistent test results, we concluded this approach was ineffective.
During further debugging, we considered another possibility: insufficient compute resources leading to event backlogs and memory buildup.
Assuming one character requires one unit of compute power, the system needed approximately 300,000 units. At the time, only 10 logic servers were running, meaning each server had to handle the logic for ~30,000 characters—likely exceeding the capacity of a single core.
We added 4 more logic servers and observed the system. It has now been running stably for over 30 minutes with memory usage below 2%, suggesting that the root cause was a compute bottleneck.
Initially, we were misled by top
, which showed overall CPU usage below 40%, giving the false impression that resources were sufficient. What we missed was that each host only ran 1 logic unit and 2 network units. When the logic unit was overloaded and network units idle, overall CPU usage did not reflect the true bottleneck.
This experience highlighted the need for per-unit CPU usage metrics going forward. This will help identify bottlenecks accurately and trigger appropriate alerts. Under high load, pairing each logic unit with a dedicated network unit may provide better hardware utilization.
In the end, we ran the system using 14 c7i.xlarge
instances.