Indie game storeFree gamesFun gamesHorror games
Game developmentAssetsComics
SalesBundles
Jobs
Tags
(+1)

OK, I now have another program I want to share. Or, kind of two. But one uses the other, so... The combination is one program? Maybe? I dunno.

This program also works in the default mode, and is primarily meant as a demonstration of a pretty old technique for handling code that won't fit in RAM. (I was introduced to it back in my DOS days, using Turbo Pascal... 5 I think? or 6?)

As such, this demonstration program basically goes the other way than my previous one, being far larger - in fact, as currently written, it fills most of the disk drive, taking up 240 addresses. Many of those are due to NILLISTs, though, which I included primarily because they made the development work easier. (More details below.)

The technique is known as overlays, and essentially entails loading only a part of the program at any one time, and then when a different part is needed, that part is loaded instead, on top of the other code, thus overlaying it - hence the name.

There's essentially two main parts of this program - one is the overlay loader, which stays resident in RAM at all times, and can be used by any program to load a chunk of code from the disk - while the other part is the demonstration program that uses overlays.

The overlay loader

In my implementation, the bootloader is also the overlay loader, since the task is essentially the same - in fact, the exact same code is used for the initial bootloading as for the overlay loading. Thus, my overlay loader is in ROM, not on disk - and isn't counted in the size of the demonstration program. This bootloader does still work for programs that don't use overlays, like the original win program.

Each overlay on disk is expected to have a one-word header, which contains the last address to be loaded. This works for non-overlay programs being  bootloaded because they start at address 0, which makes the number of words to load be the same as the last address to load.

A few things worth noting:

  • The loader initializes register 15 to 1, and expects you to never change it.
  • The loader uses register 14 internally, and expects you to never change it.
  • When called, the loader expects register 2 to be the disk address to load from.
  • The loader changes registers 1, 2 and 3 (and 14), but leaves the rest untouched.
  • Calling the loader is typically done with "JMP 3 9".
  • Each overlay can be at most 22 words long (so it doesn't overwrite the loader) in the game's default mode.
  • After loading an overlay, the loader always jumps to address 0.

Of course, all of these (except the last) can be ignored if you don't intend to use overlays. In that case, the loaded program can use 4 more words (26 total in default mode) before the loader has overwritten so much of itself that it can't load any more.

The main reason for using the suggested way of calling the loader is that it works even if the machine has been given more memory, as long as the loader itself has been adjusted accordingly (which you probably would want to do anyway to enable loading larger programs) - without needing to modify the program or overlay being loaded.

Here's the code for the {boot,overlay }loader:

SET 15 3 1 // R15=1 : initialize register for math use
MOVI 14 29 // R14=ram[29] : load the save-to-RAM instruction into a register
JMP 3 9 // jump to loader; R2 is already 0 which is where we should start
NILLIST 19
GETDATA 1 3 2 // R1=read(disk, R2) : read disk address of last instr. to load
MATH 1 3 5 // R3=R1 : set loop max register to that disk address
SET 14 3 0 // R14[address]=0 : initialize RAM address
MOVO 14 29 // ram[29]=R14 : update the save-to-RAM instruction; start of loop
MATH 15 14 0 // R14++ : increment RAM address
MATH 15 2 0 // R2++ : increment disk address
GETDATA 1 3 2 // R1=read(disk, R2) : read program data from disk
MOVO 1 0 // ram[0]=R1 : save loaded data to RAM; note: self-modified
IFJMP 2 5 3 // loop: jump 5 back if R2<R3
JMP 1 0 // start the loaded program

The demonstration program

The demonstration program is another implementation of the win screen, this time by using hard-coded line drawing operations rather than storing the screen as an image and loading that.

It does this without modifying itself at all, other than by loading overlays - the disk doesn't contain even a single MOVO instruction, nor any disk read/write instructions. (It could, it just doesn't need to, so it doesn't.)

There are two library procedure overlays, one for drawing horizontal lines, and one for drawing vertical lines, with parameters for where to draw that line and with what color (using registers to pass the arguments).

The program starts by calling horz-line twice for the top and bottom borders, then vert-line twice for the left and right borders, then calls horz-line in a loop to fill the background. (Thus showing overlays can be reused.)

Then, for the characters, it has a simplified copy of the vert-line procedure, which it calls 10 times for the various lines of the characters, with a few explicitly set pixels here and there.

Thus, in total it ends up drawing 20 lines plus setting 3 pixels, all done with actual code (instead of looping though a list of lines to draw, as then it would just be another image file viewer, just for a vector image instead of a raster image).


Obviously, all this code doesn't fit in 32 words of RAM, nevermind the 26 that are the most my bootloader can load from disk.

So, instead, I split it up into 10 overlays that are each 22 words or less (since that's the most my loader can load without overwriting itself in the default mode).

The exact length of each overlay varies, and some of them take advantage of this by first doing some initialization and then loading a smaller overlay on top of that initialization code, which runs and then resumes the code of the first overlay without having to reload it.

Of course, having to stop to load the next bit of code means that this runs slower (with pauses) than if everything was in RAM, but in return it's not as limited by RAM size. (Also, this demo program could still be optimized a bit.)

For ease of development (not having to keep modifying the addresses everywhere if I change the size of an overlay), the disk is split up into chunks of 24 words, that each contain one overlay. This "wastes" some space between the overlays, which is filled with NILs, but made things easier on me, so I considered it worth the tradeoff. (Of course, if I had needed more than 10 chunks, I would have run out of space. Luckily, I didn't.)

I actually wrote each overlay in a separate file, and then copied the code into a combined file with the disk formatting (and editing in the actual disk addresses), but since I did have to go back and edit things sometimes, having the chunks was nice. The combined file is the most useful one, so it's the one I've included here.

Here's the code for the demonstration program:

// main 0: draw top and bottom border
// start address: 0 = 00000000000000000000000000000000
// length: 16
DATAC 00000000000000000000000000010000 // last address: 16
JMP 0 4
DATAC 11000000000000000000000000000000 // Start pixel top-left
DATAC 00000000000000000000000000010000 // End X position
DATAC 00000000000000000000000000011000 // Disk address of hline proc
MOVI 10 1 // R10=ram[1] : load start pixel top-left
MOVI 11 2 // R11=ram[2] : load end X position
MOVI 2 3 // R2=ram[3] : load disk address of hline proc
JMP 3 9 // Jump to overlay loader : load and run the hline proc
MATH 10 2 5 // R2=R10 : previous start pixel
MOVI 10 14 // R10=ram[14] : load start pixel bottom-left
MATH 10 3 5 // R3=R10 : new start pixel
IFJMP 1 0 1 // if R2!=R3 (old start pixel is not same as new) then run again
MOVI 2 15 // R2=ram[15] : load disk address of next part of main
JMP 3 9 // Jump to overlay loader : load and run the next part of main
DATAC 11000011100000000000000000000000 // Start pixel bottom-left
DATAC 00000000000000000000000001001000 // Address of next part of main
// last address: 16
NILLIST 7 // alignment for the next overlay part
//
// subproc: draw horizontal line
// inputs:
//   R10: start pixel and color
//   R11: end X position (not inclusive - will not be drawn)
// overwrites: R1, R2, R3
// start address: 24 = 00000000000000000000000000011000
// length: 8
DATAC 00000000000000000000000000100000 // last address: 32
MATH 11 3 5 // R3=R11 : copy end X position
MATH 10 1 5 // R1=R10 : copy start pixel pos/color
MATH 2 2 1 // R2=0 : clear R2 so start pos will be the only thing in it
PMOV 1 2 2 5 26 1 // R2=R1[posX] : copy start position
SETDATA 0 3 1 // write current pixel to screen
MATH 15 2 0 // R2++ : increment X
PMOV 2 1 28 31 26 0 // R1[posX]=R2 : copy next X position
IFJMP 2 3 3 // if R2<R3 (curX < endX) then continue loop
// last address: 32
NILLIST 15 // alignment for next overlay part
//
// subproc: draw vertical line
// inputs:
//   R10: start pixel and color
//   R11: end Y position (not inclusive - will not be drawn)
// overwrites: R1, R2, R3
// start address: 48 = 00000000000000000000000000110000
// length: 8
DATAC 00000000000000000000000000111000 // last address: 56
MATH 11 3 5 // R3=R11 : copy end Y position
MATH 10 1 5 // R1=R10 : copy start pixel pos/color
MATH 2 2 1 // R2=0 : clear R2 so start pos will be the only thing in it
PMOV 1 2 6 8 23 1 // R2[posY]=R1[posY] : copy start position
SETDATA 0 3 1 // write current pixel to screen
MATH 15 2 0 // R2++ : increment Y
PMOV 2 1 29 31 23 0 // R1[posY]=R2[posY] : copy next Y position
IFJMP 2 3 3 // if R2<R3 (curY < endY) then continue loop
// last address: 56
NILLIST 15 // alignment for next overlay part
//
// main 1: draw left and right border
// start address: 72 = 00000000000000000000000001001000
// length: 16
DATAC 00000000000000000000000001011000 // last address: 88
JMP 0 4
DATAC 11000000100000000000000000000000 // Start pixel top-left
DATAC 00000000000000000000000000000111 // End Y position
DATAC 00000000000000000000000000110000 // Disk address of vline proc
MOVI 10 1 // R10=ram[1] : load start pixel top-left
MOVI 11 2 // R11=ram[2] : load end Y position
MOVI 2 3 // R2=ram[3] : load disk address of vline proc
JMP 3 9 // Jump to overlay loader : load and run the vline proc
MATH 10 2 5 // R2=R10 : previous start pixel
MOVI 10 14 // R10=ram[14] : load start pixel top-right
MATH 10 3 5 // R3=R10 : new start pixel
IFJMP 1 0 1 // if R2!=R3 (old start pixel is not same as new) then run again
MOVI 2 15 // R2=ram[15] : load disk address of next part of main
JMP 3 9 // Jump to overlay loader : load and run the next part of main
DATAC 11111100100000000000000000000000 // Start pixel top-right
DATAC 00000000000000000000000001100000 // Address of next part of main
// last address: 88
NILLIST 7 // alignment for next overlay part
//
// main 2: draw background
// start address: 96 = 00000000000000000000000001100000
// overwrites: R1, R2, R3, R10, R11, R12
// length: 20
DATAC 00000000000000000000000001110100 // last address: 116
JMP 0 4
DATAC 10000100100000000000000000000000 // Start pixel top-left
DATAC 00000000000000000000000000001111 // End X position
DATAC 00000000000000000000000000011000 // Disk address of hline proc
MOVI 10 1 // R10=ram[1] : load start pixel top-left
MOVI 11 2 // R11=ram[2] : load end X position
MOVI 2 3 // R2=ram[3] : load disk address of hline proc
JMP 0 8 // jump to the rest of the initialization
MATH 15 12 0 // R12++ : increment Y position
PMOV 12 10 29 31 23 0 // R10[posY]=R12[posY] : copy new Y position
MATH 12 2 5 // R2=R12 : copy Y position for comparison
MOVI 3 18 // R3=ram[18] : load end Y position
IFJMP 1 0 3 // if P2<P3 (curY < maxY) then continue loop
MOVI 2 19 // R2=ram[19] : load disk address of next part of main
JMP 3 9 // Jump to overlay loader : load and run the next part of main
MATH 12 12 1 // R12=0 : clear value for insertion of Y position bits
PMOV 10 12 6 8 23 1 // R12[posY]=R10[posY] : copy start Y position
JMP 3 9 // Jump to overlay loader : load and run the hline proc
DATAC 00000000000000000000000000000111 // End Y position
DATAC 00000000000000000000000001111000 // Address of next part of main
// last address: 116
NILLIST 3 // alignment for next overlay part
//
// main 3: draw W, part 1 of 2
// start address: 120 = 00000000000000000000000001111000
// overwrites: R1, R2, R3
// length: 22
DATAC 00000000000000000000000010001110 // last address: 142
MOVI 1 12 // R1=ram[12] : copy start pixel pos/color
MOVI 3 13 // R3=ram[13] : copy end Y position
JMP 1 15 // jump to drawing routine
MOVI 3 13 // R3=ram[13] : copy end Y position
IFJMP 0 4 1 // if R2!=R3 then second line is done already
SET 1 0 01001010 // R1[0]=.. : increment X position, keep rest
SET 3 3 00000111 // R3[3]=7 : update end Y position
JMP 1 15 // jump to drawing routine
SETDATA 0 0 0100110110000000000000 // draw top pixel for center of W
SETDATA 0 0 0100111000000000000000 // draw bottom pixel for center of W
MOVI 2 14 // R2=ram[14] : load disk address of next part of main
JMP 3 9 // Jump to overlay loader : load and run the next part of main
DATAC 01000100100000000000000000000000 // Start pixel, top-left
DATAC 00000000000000000000000000000100 // end Y position, middle of W
DATAC 00000000000000000000000010010000 // Address of next part of main
MATH 2 2 1 // R2=0 : clear R2 so start pos will be the only thing in it
PMOV 1 2 6 8 23 1 // R2[posY]=R1[posY] : copy start position
SETDATA 0 3 1 // write current pixel to screen
MATH 15 2 0 // R2++ : increment Y
PMOV 2 1 29 31 23 0 // R1[posY]=R2[posY] : copy next Y position
IFJMP 2 3 3 // if R2<R3 (curY < endY) then continue loop
JMP 1 3 // jump back to resume drawing commands with next line
// last address: 142
NILLIST 1 // alignment for next overlay part
//
// main 4: draw W, part 2 of 2
// note: uses code from the previous part
// start address: 144 = 00000000000000000000000010010000
// overwrites: R1, R2, R3
// length: 13
DATAC 00000000000000000000000010011101 // last address: 157
MOVI 1 10 // R1=ram[10] : copy start pixel pos/color
MOVI 3 11 // R3=ram[11] : copy end Y position
JMP 1 15 // jump to drawing routine
MOVI 3 11 // R3=ram[11] : copy end Y position
IFJMP 0 4 1 // if R2!=R3 then second line is done already
SET 1 0 01010010 // R1[0]=.. : decrement X position, keep rest
SET 3 3 00000111 // R3[3]=7 : update end Y position
JMP 1 15 // jump to drawing routine
MOVI 2 12 // R2=ram[12] : load disk address of next part of main
JMP 3 9 // Jump to overlay loader : load and run the next part of main
DATAC 01010100100000000000000000000000 // Start pixel, top-right
DATAC 00000000000000000000000000000100 // end Y position, middle of W
DATAC 00000000000000000000000010101000 // Address of next part of main
// last address: 157
NILLIST 10 // alignment for next overlay part
//
// main 5: draw I, and draw N part 1 (first line)
// note: uses code from a previous part
// start address: 168 = 00000000000000000000000010101000
// length: 15
DATAC 00000000000000000000000010110111 // last address: 183
MOVI 1 11 // R1=ram[11] : copy start pixel pos/color for I
MOVI 3 13 // R3=ram[13] : copy end Y position
JMP 1 15 // jump to drawing routine
MATH 1 2 5 // R2=R1 : copy current pixel pos/color
MOVI 3 12 // R3=ram[12] : copy start pixel pos/color for N
IFJMP 0 4 2 // if R2>R3 (cur pix > N start) then second line is done already
MATH 3 1 5 // R1=R3 : copy start pixel pos/color for N
MOVI 3 13 // R3=ram[13] : copy end Y position
JMP 1 15 // jump to drawing routine
MOVI 2 14 // R2=ram[14] : load disk address of next part of main
JMP 3 9 // Jump to overlay loader : load and run the next part of main
DATAC 01011100100000000000000000000000 // Start pixel for I
DATAC 01100100100000000000000000000000 // Start pixel for N, top-left
DATAC 00000000000000000000000000000111 // end Y position
DATAC 00000000000000000000000011000000 // Address of next part of main
// last address: 183
NILLIST 8 // alignment for next overlay part
//
// main 6: draw N part 2 (diagonal)
// note: uses code from a previous part
// start address: 192 = 00000000000000000000000011000000
// length: 13
DATAC 00000000000000000000000011001101 // last address: 205
MOVI 1 10 // R1=ram[10] : copy start pixel pos/color
MOVI 3 11 // R3=ram[11] : copy end Y position
JMP 1 15 // jump to drawing routine
MOVI 3 11 // R3=ram[11] : copy end Y position
IFJMP 0 4 1 // if R2!=R3 then second line is done already
SET 1 0 01101110 // R1[0]=.. : increment X position, keep rest
SET 3 3 00000110 // R3[3]=6 : update end Y position
JMP 1 15 // jump to drawing routine
MOVI 2 12 // R2=ram[12] : load disk address of next part of main
JMP 3 9 // Jump to overlay loader : load and run the next part of main
DATAC 01101001000000000000000000000000 // Start pixel, part 1 of diagonal
DATAC 00000000000000000000000000000100 // end Y position, middle of N
DATAC 00000000000000000000000011011000 // Address of next part of main
// last address: 205
NILLIST 10 // alignment for next overlay part
//
// main 7: draw N part 3 (last line), and draw exclamation mark, and halt
// note: uses code from a previous part
// start address: 216 = 00000000000000000000000011011000
// length: 13
DATAC 00000000000000000000000011100101 // last address: 229
MOVI 1 10 // R1=ram[10] : copy start pixel pos/color for N line
MOVI 3 11 // R3=ram[11] : copy end Y position
JMP 1 15 // jump to drawing routine
MOVI 3 11 // R3=ram[11] : copy end Y position
IFJMP 0 4 1 // if R2!=R3 then second line is done already
MOVI 1 12 // R1=ram[12] : copy start pixel pos/color for excl. mark
SET 3 3 00000101 // R3[3]=5 : set end Y position
JMP 1 15 // jump to drawing routine
SETDATA 0 0 0111101100000000000000 // draw bottom point for excl. mark
HLT // We're done drawing the win screen, so halt here
DATAC 01110000100000000000000000000000 // Start pixel, top of N line
DATAC 00000000000000000000000000000111 // end Y position, bottom of N
DATAC 01111000100000000000000000000000 // Start pixel, top of excl. mark
// last address: 229
NILLIST 10 // alignment for next overlay part
(1 edit)

I'd say you shared a program, and the compiler/runtime to make it run.  Very fancy!

I've thought about writing something similar to this independent of having ever actually seen something like it in action before, but it makes sense that the idea's been around for a while.

If you wanted to be thrifty with your space usage, I'd say to make the start of the drive a list of starting addresses for each function, and then make the actual first address of said function contain the length of it/the ending address.  That would (in theory) cut down on space usage, especially when you're not dealing with too many small functions where the 8-byte overhead could take up a large part of the overall space requirements.

(Also: I'm super happy to see flag 3 for JMP in use.  I knew it would come in handy someday!)

(EDIT: OMG, I'm blind, you already had the end address at the start of each function!)

Seeing the actual code for your testing win screen (and the associated walls of addresses) frankly makes me want to work on some kind of custom out-of-game code editor.  Having a tool where you could write functions & such in their own screens, easily reference other functions (even pass arguments to them!), etc, would be SUPER cool, and probably make life a ton easier for high-level programmers who can't get their heads out of OOP-land (like me >.<)

Very impressive work, in any case!

(+1)

Thank you. :)

Yeah, the end address (or length, but end address was easier for the code to deal with) pretty much had to be in there somewhere, so I could load only the necessary words from disk (both for speed and to allow partial overwrites with some code remaining behind). Putting it at the start of each function means the loader only has one address it needs to deal with to start loading that code.

If I was really strapped for disk space I suppose it might be possible to put the length as the first byte of the given disk address or something, but that would both limit the max address I could load from, and make the loader code more complicated. In turn, more complicated loader code would mean the loader was larger, and thus would leave less RAM space for each overlay, which means not only that they would be trickier to write, but that you'd probably need more of them - so did we actually save any space doing it..?

Putting the start addresses in a table at the start of the disk is a nice idea though; while it does take an extra word, it would allow reference to functions by index into that table instead of needing to reference it by address directly, which would add a level of indirection so that the calling code doesn't need to change even if the function has to be moved.

Something I've been thinking of is to write a script that takes a bunch of code files (like I had here) and merges them together into a "disk image" code file for me (as in, adds in the length word and such for each of them while concatenating them).

That would simplify my work as a programmer, since I wouldn't have to do that merging manually, and it would save disk space since it wouldn't need to insert all those NILs to make tracking the addresses easier (computers are good at such tracking - generally much better than humans).

However, it would also have to insert the resulting addresses into the code in the right places for me, since I no longer know where exactly on disk each file will end up. So I was thinking I'd have to make up some syntax for that. (Your idea about the function table might make that unnecessary, though, as long as I can predetermine which function gets which index in the table.)

Essentially, I'd be adding a preprocessor (like C has). And if I do get around to making it, there are some other features I've been thinking of too, like having labels for addresses (both disk and RAM), so thaht it doesn't matter if I move the DATACs around as long as they're labeled (and as long as I used that label in the relevant instructions).

On the other hand, something else I've been thinking is that if I start down that road, maybe I should I just go all the way immediately, and instead of writing my own compiler, just make a new target for an existing one like gcc or LLVM. That way we could write in C/C++ or other high-level languages and have it compiled down to the TC-06 for us. Though that assumes that the compiler can cope with the extremely restricted nature of the TC-06 - especially the default mode would probably be difficult, with its tiny amount of RAM limiting the code space. It also assumes that I could figure out how to do that - it's often easier to make your own code base than to figure out someone else's after all...

Regarding the out-of-game code editor, like I mentioned, I've been writing most of my programs outside of the game and just pasting them in (*thank you* for making copy/paste work). It's just been easier that way, in part due to higher familiarity with the editor.

Having a purpose-built IDE could be neat, but I generally think it would be better to build more general tools that can be used with existing IDEs first.

Take, for example, those function references you mentioned - if I understand you correctly, they would require some extensions to the language (if not a whole new language), so there would need to be a compiler of some sort to turn those into the actual assembly code used by the game.

I think starting with that compiler would be better than starting with a new IDE to host it, in large part because the compiler would be the new part - while pretty good IDEs already exist, that people are familiar with, and which could be used with such a compiler. (It would also enable non-IDE building, e.g. for larger projects using make.)

Of course, if what you _want_ to do is to build an IDE, don't let me stop you. If people only ever did the "best" thing, the world would be a boring place. Which, honestly, makes it arguable whether it was really the best thing to do in the first place... (Not to mention, best according to whom?)

(Also: I probably wouldn't have missed it if you hadn't added it, since I wouldn't have thought of it, but yes, JMP 3 did come in handy... :) )

(1 edit)

Using the indexed start-of-disk system would be really handy for writing functions in a more efficient manner, and for making a program that can easily generate an Overlay-ready file with functions inside of it - being able to consistently refer to a function by ID sounds like an absolute lifesaver!

My idea for the IDE would, at first, basically just be a preprocessor.  All code is still written purely in TC-06 assembly, but you have some special "codes" the IDE knows how to use (i.e, "FUNC <function_name>" automatically inserts the code to jump to a function - the names are for human purposes only, it'll automatically pick an integer id for 'em at generation time), and it lets you work on individual functions easier.  That alone would probably make my life way easier when making games, OSes, etc.

Actually building up to a proper C/C++/C#/Java/etc compiler/parser of some sort that can output TC-06 assembly code would be INCREDIBLY COOL and definitely stands as a goal for sometime in the future, but I do have a feeling it's going to require some additions to the TC-06 language itself.  There's space for 5 more op-codes (plus, those codes could end up being some sort of "sub-code" thing - e.g, SUBONE [4-bit op-code from the sub-code #1 list] [24 bits of arguments] could be a new "command" to allow 15 more commands with less argument data), which I should hope would be enough!

I'm definitely gonna poke around in QT and see what tools I have available for making an IDE/preprocessor, and if I can wrap my brain around it, I definitely wanna try making a basic test of the preprocessor tool, because I really do think having some basic built-in function/memory location management could be a lifesaver while programming bigger things.

Unrelated-ish, but I'm actually kinda tempted now to run some sort of Senbir Game Jam where the goal is just to make a game that runs in Senbir, in, say, a week.  Not many people are likely to participate, I suspect, and there are some limitations in Senbir itself at the moment (like the apparent lack of support for holding down a key?), but I know I for one would have fun trying (and probably failing) to make a proper game in it.  The one major caveat I see there is the fact that the (apparent) maximum clock speed of 180 Hz is definitely a major limiting factor that could mess up people's entries, and having to deal with the gameified render layer could annoy potential players of said games.

EDIT: Also, possible thing of interest - I've set up an official Senbir Discord server!  Link/join widget is in the main Itch page.

I think the start-of-disk function index system will mostly be useful for when you're doing it manually and referring to the functions by ID yourself.

If your assembler/compiler/preprocessor/IDE (or whatever you want to call it) lets you use "FUNC <name>" and inserts the code to jump there on its own, keeping track of the IDs for you - then it must also keep track of the addresses for those IDs, which suggests it can probably just as easily use those addresses directly, without needing the IDs or such an index table.

Though, I suppose we would want to minimize the number of instructions (or other resources) needed to do a call, to maximize what's available for other code, so I guess havin the table might help with that.

Note: the rest of this post ended up being a rather in-depth analysis of the available approaches that such an assembler/etc. could use. Feel free to correct me if I've missed something, but apparently there's no cut-and-dried "best" approach, and it depends on circumstances... Also, I originally intended to respond to other parts of your post as well, but this is too long already, so I'll add separate replies for those later.

Let's see now. I don't think the JMP to the loader routine can be avoided (unless you can put the call at the end of the memory area), so that's one instruction, but since it's unavoidable we can ignore it for the purpose of comparing the different approaches.

So the main thing we need is to somehow get the actual disk address to load the function from into a register for the loader to use.

With index table

Now, if we have and use an index table somewhere on disk, what would the code for that look like?

Well, to start with, it means we must have a GETDATA instruction to load the actual address from the index table, based on the index.

GETDATA supports an immediate (constant) value, so can we use that for the index, to load the address directly, with just one instruction?

*checks the docs* ... well, that constant value gets shifted up by 10 bits (documentation says 11, but I think that's incorrect, and 11 would be even worse for this). This means that it can only be used to load from addresses that are multiples of 1024. Which means we get one function per 4KiB of disk space, minus one for address 0 since that's where the initially bootloaded program is. Not really useful with the default 256-word disk.

So, we can't practically have the table index inside that one instruction, which means it must be stored either in memory or in a register.

The memory variant means having the function index stored in memory somewhere, probably loaded as part of the current function, and having the GETDATA instruction point to that address. The obvious way to do this would leave each call site using two memory addresses.

Trying to be clever and putting the instruction in the loader doesn't really help either, since we'll either be limited to one call per function, or have to spend instructions on modifying memory (either the instruction or the address with the index), which will probably take more addresses total.

The register variant means getting the function index into a register somehow, which will (generally) require at least one instruction, but in return it's possible to move the GETDATA instruction to the loader.

If we dedicate a register to it and limit ourselves to 256 functions (minus the initially bootloaded program), we can use SET to have it use just one instruction/memory address per call site.

If we don't dedicate a register to it, or need more functions, we have to either store the function index in memory somewhere (to use MOVI), or use multiple instructions (e.g. MATH to clear + SET) - so at least two addresses per call site. This is no better than the memory variant, so we can disregard this option.

So, with an index table, we end up with these options to choose from:
- two memory addresses per call site
- one memory address per call site, plus one in the loader and one register, with limitations

This is in addition to the space taken by the table itself, and the JMP for each call site.

Without index table

Now, if we don't have an index table, and just use the address directly, what can we do then?

Using MOVI will work, and will take two memory addresses - one for the instruction, one for the function address. Simple and straightforward.

Using SET can work, and will take just one instruction, but requires all functions to start at addresses before 256 (since we can only use 8 bits), and that we're careful about how the load-address register is used (since the other 3 bytes must remain 0).

However, the loader is currently using R2 for the load address (to use IFJMP), which I doubt we can be that careful with, so we'll need to use another one, and copy it into R2 as needed. That copy instruction can be in the loader, but this will essentially require a dedicated register.

It's also possible to use multiple SET instructions, to ease (or even eliminate) the limitations, but that means using between two and four instructions per call site, without being any better than using MOVI, so we can disregard this option.

Using PMOV or MATH essentially requires already having the address (or something that can easily be turned into the address) in a register, which seems rather unlikely in the general case, without spending extra instructions on it. Generating a function index using this method is somewhat more likely, but not much. So either way, this can only be used for special cases.

So, without an index table, we end up with these options to choose from:
- two memory addresses per call site
- one memory address per call site, plus one in the loader and one register, with limitations

This is in addition to the JMP for each call site, but avoids taking space for a table.

These options are very similar to the ones with a table, though not quite the same, in that the limitations of the second option are a bit different.

Summary of options

The first option for both uses the same number of instructions, addresses and clock cycles, which suggests that with those approaches, the table is an unnecessary indirection that just takes extra disk space.

The second option for both also uses the same number of instructions, addresses and clock cycles. But they have different limitations.

Now, if the disk is no larger than 256 addresses (like the default disk), then the limitations are void for both, since they are based on needing more than 8 bits to address things. So in this case the table again just takes extra disk space.

However, if the disk is larger than that, then the limitations come into play, and then the table is actually useful as it allows us to have functions stored anywhere on the disk instead of just in the first 256 addresses. However, even with the table, we are limited to at most 256 functions, minus the space taken by the initial program.

Which approach to choose

The above suggests that the choice of approach will mainly be based on which resources are most dear, and how many function calls you have.

First off, if you need more than 256 functions, you obviously need to use the first option, and thus don't need the function table.

If you need all the registers you can get, but have room to spare in memory, then again the first option is probably best.

If you can spare a register, but have lots of calls and little room in memory, then the second option is probably best, as it saves you one word of memory per function call.

If you have little disk space, and at least a couple calls, then the second option without a table is probably best, as it saves you both the table and one word per function call (not just in memory, but on disk).

If your dearest resource is your sanity (because you don't have an assembler/etc. that does this work for you), then I think the second option with a table is probably best, closely followed by the first option with a table, with the non-table options pretty far behind...

Aaargh. Of course. Just moments after posting, and I realized I missed something rather significant... (The worst part is that I'm pretty sure I had noticed it early on and started to write something about it, but it apparently got lost somewhere while editing/expanding, and then I forgot about it...)

For both of the with-table options, I missed/forgot that GETDATA always puts the result in R1, while the loader expects to get the load address in R2, so it can use it with IFJMP without having to move it there first.

This means that those options require another instruction, to move the address from R1 to R2, either per call site or in the loader itself.

Thus, the with-table options are always more expensive (in both memory size and clock cycles) than the non-table options, instead of being equivalent.

... I do think that most of the post is still valid, though, including the last section.

PSA: There is an edit button, so in the future, if you post something and go "aw crap, I should've remembered to add [insert thing here]!" - you have a way out without double-posting.  I know I've done that before - it happens to the best of us!  At least this isn't Twitter or something, right?  :P

I am inclined to agree with pretty much all your conclusions on efficiency analysis - the tables are pretty much only good for human sanity when developing without an IDE or similar.  Having to search & replace for address changes is an absolute pain.  I don't have much in the way of comments beyond that - brain is a bit numb from it being late where I am, and from programming a kernel for the past several hours.  Will re-read everything in the morning when I'm less zombified.  >.<

(+1)

Yeah, I know there's an edit button, but at the time I really didn't feel like doing the editing required, since it wasn't just a matter of "should have added this" - to edit in the correction I would have had to change and rewrite significant chunks of the text, such as the summary, and it was already late (getting towards early), and it would have taken enough time that others might have read the first version in the meantime, so I'd have to add something about the correction anyway, and... yeah.

I suppose I could have added my reply as an "EDIT:" at the bottom or something, but it felt more appropriate as a reply, since it was a pretty significant correction. (And in my mind, double-posting is something entirely different than responding to yourself. But I guess I might be out of touch on that.)

(+1)

(Note: This is based on the version that was out before the update mentioned in the below post about the OS kernel. I don't know if that update changes anything that matters for this post, but thought I'd mention it just in case.)

Regarding the op-codes, one thing I've noticed is that the current instruction set is fairly RISC-like. It even mostly uses a load/store architecture, with MOVI/MOVO being the only memory-accessing instructions - except for GETDATA/SETDATA with flag 1 or 2. That's pretty neat.

Also, several of the existing instructions already have the kind of sub-code you mentioned, though usually called "flag", and their bit positions vary.

E.g. MATH has sub-codes for ADD, SUB, MOV, etc., and IFJMP has two separate ones for forward/back and for EQ, GT, etc..

Regarding the max clock speed, based on the numbers you're using, I'm guessing your clock ticks are currently tied directly to the game's rendering frame rate, or maybe to some kind of timer (with a limited resolution).

However, I don't think you really need them to be tied that way - after all, for an emulator, it doesn't really matter if the ticks are evenly spaced. The most you might need is for them to appear that way to the player.

So, a fairly simple technique for that is to run two or more ticks per frame (or per timer triggering). This would double (or more) the apparent clock frequency, as long as the CPU can actually keep up with the emulation.

I used that technique in my JS TC-06 emulator, and have managed to hit 500kHz that way. Admittedly only with the debugger off, and a very simple program (the max I've reached depends on what it's doing) - but this was in JavaScript, running in Firefox, with code that isn't primarily built for max speed - so I suspect you can do better than that in C#, without needing to move to C++ or using things like threads.

Of course, that assumes you actually are timer bound and not CPU bound, but I don't really see how you could be CPU bound at that frequency, without trying to be...

(1 edit)

RedCode and RISC/CISC (as described by my dad, a C64 vet.  can never remember which, though >.<) have been my main inspirations for developing the TC-06 process set.  While many of the functions already support sub-codes of their own, the main purpose of adding a dedicated sub-code function would be for utilities, mainly - which is what the current UTL code does.  UTL 0 is the offset function used in the kernel, and the assembler will just directly accept OFST [starting address] to output the equivalent of UTL 0 [address].

The max clock speed seems to be a limitation specifically of the Unity function I'm using to execute the actual function that does a clock cycle, MonoBehaviour.InvokeRepeating(string methodName, float time, float repeatRate).  If I could get past that, I have a feeling the clock speed limitation would disappear.  Your idea about having it run multiple ticks per call of the function from InvokeRepeating is a good one - it wouldn't be hard to amp it up to at least 1.8kHz, maybe further!  Unity can be a bit badly-optimized, in my experience, so even a 1kHz max would be a big deal to me - getting up to, say, 1MHz would be huge (and I'm genuinely not sure it's possible without using an external library.)  I would LIKE to keep it trying to run at a specific clock speed (e.g, keep a timer, rather than just "execute as many cycles per frame as possible"), because the limitations in processor speed are a fun part of the challenge for TC-06 development.

You made a JS implementation of the TC-06?  That's mind-blowingly cool!  Do you plan to share it anywhere?  Also, 500kHz really isn't half-bad, especially when you compare it to the current max speed of ~1/2778th that in Unity - I'm not sure that 500kHz could be topped in Unity with just raw C# stuff, but handling it with a native C++ library or similar should pretty much guarantee much higher speeds.

(+1)

Yeah, I did - well, it's not complete, and probably has bugs in it (as in, things that don't work quite the same as in yours), and parts that could use optimization, but it seems to work, more or less.

I'm not sure why I made it, really, but couldn't seem to stop myself... Probably mostly the challenge, to see if I could, without looking (at least too much) at how you did it, just at the docs and the in-game behaviour. (I haven't really looked through most of your code yet, just some of the latest diffs to see what exactly had changed - and I haven't yet integrated those changes into mine.)

I haven't shared it anywhere yet (wasn't sure you'd approve and figured I'd ask first), but I can push it somewhere (probably GitHub) if you'd like to see it (and don't mind it becoming public).

I agree that it's good to try to hit a specific frequency rather than simply running it as fast as it can - that's what my emulator does, too. I don't usually run it as fast as it can go, unless I'm specifically trying to find the max frequency or compare the performance of different implementations. (The default frequency is set at 60Hz to match Senbir.)

Also, that 500kHz figure is pretty much the highest one I've managed so far - most of the time I don't get anywhere near that (not that I really try to, or check all that often). Especially enabling the memory debugger limits the max frequency to a lot less, but it also depends on exactly which instructions it's running, since those take varying amounts of real time (e.g. a simple MOVO vs updating the screen). It has also varied over time with changes to the implementation. I think I've seen something like 30kHz too, at the lower end.

Of course, if I had done just one emulator tick per timer tick, I wouldn't have gotten anywhere close - IIRC setInterval in browsers is usually limited to something like at least 10ms between ticks, which would have limited me to about 100Hz. I still set it to the desired frequency, so it doesn't tick more often than it needs too (e.g. 100Hz when I want 60Hz), but it also doesn't always tick as often as I'd want.

My code for getting around that is a bit odd, though, in that it tries to maintain the asked-for clock frequency even if the underlying timer is unreliable (as in, has significant jitter) - so it keeps track of the time for the last tick that was run, and when the timer triggers, it checks what the current time is and runs as many emulator ticks as necessary to catch up to what it should have run by that time (whether that's 0 or 100k).

This approach is a bit risky, since if the asked-for frequency is too high, it might take so much time to run those ticks that it keeps needing to run even more ticks the next time the timer triggers, which (if not stopped) would eventually make it seem to have locked up since it takes so long.

(My approach for measuring the max Hz is to start the emulator with a reasonably high frequency, make a note of actual time and tick count, wait 10 seconds, make another note of actual time and tick count, and stop it. Then the frequency is calculated as (endTicks - startTicks) / (endTime - startTime). If I hit the target frequency, I increase the target and try again, until I fail to hit it. That gives me an approximation of how fast it can go (with that program etc.) - I usually round that result downwards to avoid problems.)

I've considered adding code to detect and protect against that runaway effect, but haven't done so yet - which enables me to measure the max without having that protection kick in.

A similar risk exists even for a simple loop that gives you N emulator ticks per time tick, too, if you don't protect against setting N too high - but at least that wouldn't start seemingly safe and then run away getting worse. (As long as your timer doesn't try to catch up to missed ticks, anyway.)

(1 edit)

I know the feels - I'm prone to "doodling" in Unity and making little projects, just because they seem like an interesting challenge.

Senbir, and by extension, the TC-06 architecture, are LGPL-licensed.  You're 100% free to redistribute mods, updates, reimplementations, etc - I believe they just have to be LGPL themselves.  I'm not sure how it affects something based on the architecture (maybe I should explicitly define that as Creative Commons or something?  dunno), but far as I'm concerned/can tell, you're good to go for sharing the JS version.  You'd need to include a link (I think) to here, and/or the Gitlab page, but that should be the extent of extra garbage you need to do for sharing it.

I'd be interested in seeing the catch-up/timer code you've got - even if it's not perfect, it'd probably be useful in figuring out how to give Senbir better support for clockrates >180Hz.  My current ideas vaguely revolve around getting some sort of average (e.g, the InvokeRepeating speed is 1f/(clockSpeed/numberOfCyclesPerInvocation), and numberOfCyclesPerInvocation=clockSpeed//180f, assuming // is a "divide and round to integer" function of some variety), but I'm not quite sure what formula would be best, and it could still cap out around some relatively low number, like 324kHz.

EDIT: *fixing derp wording because sleepy*

(+1)

Well, regarding reimplementations, IANAL, but I think clones (whether of games or other things) are an example of how that doesn't require following the license of the original (as long as the clone authors haven't copied any of that original's code, just written equivalent code themselves based on observed behaviour (and maybe documentation?)). (Come to think of it, isn't Mono in some ways an example of that? Having reimplemented .NET?)

So, given that I wrote it without referencing the original code, I probably don't really have to to use the same license etc. for it. On the other hand, it's not like I'm really against using the LGPL (though I'm not very familiar with it, esp. v3, as I usually tend to use simpler ones) - and I have since started looking at the code of the original (not to directly copy it but to figure out what exactly it supports/handles (actually for a different project)), so I guess that might be going into a gray area, since it might affect whatever code I write later even if I'm not trying to copy it. So I guess it would probably be safer and easier to just use the LGPL like you said.

Anyway, here you go: https://github.com/edorfaus/TC-06.js/

You can see it live here: https://edorfaus.github.io/TC-06.js/

Regarding the architecture itself, I'm not sure if that can really be protected with a copyright license, as I think the copyright can protect your documentation about that architecture, but IIUC not the actual information inside that documentation about the architecture itself? (Meaning that if someone learned it well enough to write their own docs from scratch, without copying from yours, I think they could do that without infringing.) I think you might have to get into patents to protect the information itself. But again, IANAL, and this is a pretty messy legal area IIUC. Not to mention the international aspects of it. (I'm in Norway, myself.)

However, I think the idea about using CC for that instead of GPL is probably a good one regardless, as the GPL is not really meant for documentation like that (or documents in general, really). Do look into it yourself before deciding, though, don't take my word for it. (I'm not an expert.)

--

That formula pair looks about right to me (maybe depending on implementation details), though you may want to tinker with that 180 number - if a lower number is more reliable (in terms of the timer actually hitting the frequency), it might be better to do that and just loop a few more times for each timer tick instead. You may also want to add some limits to avoid having someone make it try to loop 1e9 times per timer tick or something crazy like that. (Ideally it would probably adjust that relative to what the host PC can do, but that's harder to get right.)

Here's the class that handles the ticking in my implementation: https://github.com/edorfaus/TC-06.js/blob/master/clock.js

That class emits a "tick" event to run an emulator tick, while the "render-tick" event was added later for optimization purposes (the memory debugger now uses that to avoid updating its HTML elements repeatedly in one render cycle).

(I've tried to cleanly separate the various components, to keep them independent of each other (loose coupling), with just the main index.js linking them together. I do think I have a few places left with some tighter coupling that I haven't gotten rid of yet though. Feel free to ask if something's confusing, though I can't promise to always (or ever) respond quickly.)

(1 edit)

I, also not a lawyer, concur with your thoughts on all the legal jargon.

Really solid JS implementation!  I've spent a bit of time toying around with it, trying to write a basic Blinkenlights program - I have to admit, having an assembler has spoiled me.  Before Senbir, I was toying with an older prototype of the TC-06 architecture, and made a Blinkenlights program with naught but my documentation, and manually writing out the binary code for it.  I'm having a bad time of trying to do the same in the JS version, and I'm not willing to pin that on, say, unfinished debug tools or what-have-you.  >.<

Classic Terminal
(Context for the surroundings: I just hijacked a previous test/doodle project that had a basic first-person controller, so it was possible to walk around and turn the computer on/off)
I've done some maths for the formula being /100 instead of /180, and the numbers come out to something like needing to run the actual clock cycle function ~10k times per call from InvokeRepeating at 100Hz to run the computer at 1MHz, and...My Unity experience tells me it'd probably just lock up if you asked it to crunch like that.  With a native C++ library, it might be able to handle it, though - even in a low-power VM, I can do some heavy crunching with minimal slowdown.  The XScreensaver Galaxy program can be modified to be a proper N-body simulation, without MUCH lag even when there's several galaxies with 1000+ bodies each, and still manage ~45 FPS - in a VM.

I think that's probably gonna be my next development project, just trying to optimize things so it's possible to run at even a meagre 1kHz.  Well, that, and toying with a networking peripheral.  I have an idea for the actual hardware side (i.e, what you have to work with as a programmer), but I'm not sure how to go about letting players actually connect to one another, from a UX standpoint.  Do you pick a peripheral in-game, go to the debug screen, type in a real-world IP, and connect to it?  Does it automatically host things, or do you have to be a host/connector?  Stuff like that.