Thanks for the feedback. I can provide development builds in future with better stats if you're interested in helping measure performance improvements on real hardware. It might make sense to setup a Discord group to do this.
I tend to remove the granular measurements from the release builds themselves, as the extra code probably adds a bit of bloat/speed loss in itself.
The 040 build uses a faster C2P algorithm for that architecture and some limited use of the 040 move16 instruction to move data about in the codebase.
I reworked some of the original sub/road CPU code to be more register efficient with less reliance on memory access. I also changed the road rendering to minimise drawing of the secondary road layer (at the expense of a small amount of additional computation elsewhere). I also found a dumb bug where during certain situations I was clearing parts of the screen that didn't need to be.
I hope that provides a bit more background info!