Regarding MOVI/MOVO variants taking more clock cycles than MOVR, well, not necessarily, because they're more flexible and so can be used to do more parts of the job.
Consider the first version of your SET operation, with R0 pointing to the arguments - your version with MOVR takes 14 cycles, while with the MOVI/MOVO variants it could be done in 5, without self-modification (thus not needing OFST), and using one register less:
MOVIR 4 0 // R4 = mem[R0] : read the source address into reg 4 MOVIR 2 4 // R2 = mem[R4] : read the data from the source address MATH 15 0 0 // R0++ : increment argument pointer MOVIR 4 0 // R4 = mem[R0] : read the destination address into reg 4 MOVOR 2 4 // mem[R4] = R2 : write the data to the destination address
If the MOVI/MOVO variants supported both register and immediate at the same time, it could be done in 4, since then R0 doesn't need to be incremented.
For version 2, though, you're correct that it would take an extra cycle - though that assumes we're not counting the cycles required to move the pointers into those particular memory locations from wherever they were originally. That may or may not be reasonable, depending on the larger context.
(Btw, unlike MOVR, MOVIR/MOVOR could be used for your ADD operation as well.)
Of course, that flexibility also means they remove far more of the need for self-modifying code, so it depends on what you want the game to be - MOVR is somewhat less of a change in that respect.
Regarding taking up a sub-code slot, that's not in itself a problem, since there's really lots of space available - worst-case, you can not only do sub-sub-subcodes, but make multi-word instructions. (As in, the instruction at address N takes arguments from address N+1, and sets the next address to be executed to N+2 instead of N+1. Those arguments could even themselves be subcodes.) Consider e.g. the x86 family - it has 1-byte instructions, but from what I've read, it also has instructions that are up to 15 bytes long. (But then, it's got thousands of instructions - exactly how many depends on how you count them...)
Of course, an indirect problem with that is having to implement and support all those instructions... I'd say that would be a better reason to not add them without a decent reason for having them.
Regarding functions using registers or RAM, having many/large arguments, etc. - one thing to remember is that there's a major difference between language-level functions and the kind of compiler-operation-level functions I believe we were talking about here.
The language-level functions are the kind the programmer specifies in the higher-level language they're writing in (C, C#, whatever), with whatever parameters etc. they specify, and executing whatever code the programmer wrote - they're a part of the source code that is the input to the compiler.
At the CPU level, execution of these functions is generally started by a jump from whatever code called them, and once they're done executing, there's another jump to return to the calling code. Where to put the return address, arguments and return value (if any) for the function differs between architectures and languages and is part of what's generally termed the calling conventions of that architecture/language. Here, using RAM for arguments often makes a lot of sense (often in the form of a call stack).
The compiler-operation-level functions, on the other hand, are (I assume) used by the compiler to generate the assembly code that implements a given language-level function - and so are defined by the compiler itself (fixed parameters etc.), and basically return some assembly code that implements the operation that particular function is for. I'd say they are probably more accurately thought of as templates, rather than functions, since they don't directly execute the operation, but return the code for executing it (to be put into the generated program).
At the CPU level, in the program the compiler generated, those functions (or rather the blocks of code they returned) are generally reached without any jumps, and don't jump when they're done; instead execution simply falls through from one operation to the next, since the compiler placed copies of their code into the output, one after the other, in such a way that the combination does what the higher-level language described. (Doing it with function calls (jumps), while possible, would be much slower and would probably end up taking more space.)
This leads to a fairly obvious optimization, where if the code for one operation loads a value from memory into a register to do something with it, and the next operation also needs that same value (or the result of the first operation) to be in a register to do something with it, then it's rather wasteful to load it from RAM again, since it's already present in a register. (Whether that value came from an argument to the language-level function is rather immaterial at that point.) However, that does require some careful tracking of exactly which value is stored where. (This is the optimization I meant decent compilers typically do, as it's a relatively easy and effective one - though not quite as important for the TC-06 since its RAM is very fast.)
By the way, at the compiler-operation level, I'd guess there's probably no such thing as a struct - by that time, the compiler has probably already transformed operations on a struct into operations on either its component fields or the block of memory that holds it, depending on what is being done with it. At the CPU level, copying a struct (instead of just the pointer to one) generally means copying either its fields one by one, or (often quicker/easier) the entire block of memory that holds the struct (which can be done with a tight loop) - and the compiler knows this and generates the assembly code accordingly, based on the low-level operations it has available for the target architecture.
Regarding OFST and programs living at an offset without it - this can be done by pushing (some of) the responsibility into the program loader. (As an example, I know that GNU/Linux does something kinda like this, though the details are different of course.)
Basically, your kernel (or the OS around it) necessarily has some piece of code that loads a program from disk into memory to be executed, right? And that loader necessarily knows which offset into memory it is loading that program into. So, in addition to loading the program into memory, that loader could also update (parts of) the program's code to work at that offset, and tell it what that offset is so it can do whatever else it needs to do to adapt itself.
As a simple example, the loader could update every MOVI and MOVO instruction it loads by adding the offset to the address stored in that instruction - and then set a register, let's say R14, to the offset before starting execution of the program. That would make the MOVI and MOVO instructions already point at the right places, and if the program does any self-modification, it can use R14 to adjust how it does that - and since the MOVO instructions are already updated, it can easily save the other modified instructions in the right places.
Of course, a fancier loader could also update any other instructions that are known to contain absolute memory addresses, or the program file could contain a header that tells the loader to update specific other locations as well (e.g. DATACs) that it can't auto-detect - but I think updating MOVI/MOVO is enough for things to work.
Technically, R14 isn't needed in this scenario - the program could MOVI one of its MOVI/MOVO instructions and grab the offset from that - but it makes things easier, and the loader probably has the offset easily available anyway, so why not?
Technically in the other direction, I think that if the offset is given in a register (or PCSR 0 does what I think it does), then even just updating a single MOVO would be enough (could even be on a fixed address), but that would make things harder on the program. It's probably better to make the loader do more, to avoid having to duplicate that effort in every program.
Come to think of it, another solution would be to have an OS function at a known fixed absolute address, that basically does a relative MOVO - call it with a relative target address and return address, and the OS function adds the offset to those addresses before performing the MOVO operation and returning to the program. If given an equivalent MOVI-performing function, or some way to find its offset (whether R14 or PCSR 0), I think it can do everything it needs to. Though in a somewhat more complicated (and slower) manner from the program's point of view, so the other solution is probably nicer.
Regarding the MMU, I would suggest adding an MMU device (for a GETDATA/SETDATA port) rather than adding CPU instructions for its functions. Mainly because that makes it easier/more sensible for some game modes to not have it, or for custom modes to have different models that work differently or have different APIs, without the CPU's instruction set having to be different. (Though also because I see it conceptually as a separate device that happens to sit between the CPU and the RAM, even though most (but not all) CPUs these days apparently integrate it into the same silicon chip.)
Based on just a little bit of googling, it's apparently a pretty complex device, with enough possible options and variations (segmentation vs paging, page sizes, PTE levels, APIs for process switching, etc.), that I don't think we're likely to get it perfect first try - if there even is such a thing as perfect here. There are trade-offs involved. So being able to experiment with different variations with otherwise equivalent computers would probably be nice.
(One tutorial I found basically said that, like everything in programming, the only way to really understand it is to play with minimal examples - but that the minimal example for using that MMU is rather large because it involves making your own small OS... On the other hand, a tutorial for a different MMU was much smaller and simpler, so it apparently depends.)
It sounds to me like you rather like the original ISA, in part because of its limitations, and therefore don't really want to loosen those limitations up much (if at all) - but at the same time, want to be able to do more advanced stuff without having all of those limitations making it extra hard. Which suggests that having them be runtime-optional is probably the best solution. And implementing that as separate modes (or options for custom modes) sounds to me like a good way of doing that.
I may be biased in that assessment, though, as personally, I think that the limitations are part of the challenge, and that it probably wouldn't be quite as fun to do those early levels (or some of the other things I've done) without them, as it would be rather too easy. But at the same time, I agree that they can get rather annoying when trying to do more complex things later. And thus, that it would be nice to have options to enable choosing the appropriate difficulty for any given idea. (Also, I tend to go for providing options and flexibility in general, making frameworks/toolkits more than products, in part to avoid having to choose for others. This tends to cause a certain amount of overengineering on my part... Like designing my simulator so that the CPU's instructions can be replaced/reconfigured individually...)
(Also, I'm rather weird in several ways, and not really a people person - so I don't really think I should try to speak for players in general.)
Come to think of it, something like this could also be integrated as part of the main game - essentially having things be unlocked as the player goes along and reaches more difficult levels, kind of like achievements.
E.g. after finishing some basic screen manipulation levels, "Re-reading the documentation with your newfound experience, you find you now understand a part you didn't before - which shows that there's a way to change the resolution and color depth of the screen." and continue with some levels that use the higher settings.
Or when you start needing better instructions, "You suddenly notice that some pages in the manual had gotten stuck together. Carefully peeling them apart, you discover that there are some more instructions that you didn't know about before."
Or maybe, "You discover a small box tucked away at the back of the computer. It turns out to contain a memory upgrade and a better processor."
Or "Studying the internals of the computer, you notice there's a jumper marked "debug mode" on one side and "normal mode" on the other - currently set to debug mode. After changing its position, the computer starts to run much faster!"
Of course, that kind of depends on thinking up some more levels that would make for a decent progression...
Eh, no worries. This post is probably no better. *Looks at clock* ... ~8am... Ok... time to go to bed I guess... Just gonna post this first... *shakes head* (... ok, ready to post finally... *checks* 08:45... *sigh*)