Skip to main content

Indie game storeFree gamesFun gamesHorror games
Game developmentAssetsComics
SalesBundles
Jobs
TagsGame Engines

Buffered rendering?

A topic by Timendus created Oct 02, 2021 Views: 228 Replies: 4
Viewing posts 1 to 5
Submitted(+1)

I was thinking about writing a game with a couple of "layers"  (in colour, so I can't use the "each plane is sort of a layer" trick). Only having XOR for sprite drawing is a little limited for my purpose, so this evening I wrote a few rendering algorithms that operate on a buffer in memory (clear buffer, AND/OR sprite to buffer, render buffer to the screen, that sort of thing).

Now here's the thing: it's working just fine, but it's freakishly slow. Even on "ludicrous speed" :P Especially the sprite rendering to the buffer seems to be way too slow to fill the screen in a reasonable time (in hires mode especially).

Hence my question: Has anyone here ever tried to do buffered rendering? And if so, was that a success? Is my code just really silly and unoptimized, or is it just kinda outside of the reach of the platform?

HostSubmitted (3 edits)

Might work in some situations, but it gets pretty brutal.

Assuming a buffer with data packed in an appropriate representation, we can do a fast 64x32 pixel (lores) fullscreen draw by using a series of 8 sprite draws in 21 cycles:


If you fill the v-registers with zeroes and take advantage of chip-8's auto-incrementing behavior on save, you can clear the buffer in 21 cycles:


Bulk operations have to touch every byte, so the best case for inverting the pixels stored in the buffer is around 296 cycles, with aggressive loop unrolling:

Merging together multiple buffers is even nastier, since you'll need to manually ping-pong the index register from buffer to buffer. Making use of the flag registers (xo-chip has 16 of 'em) as a lookaside buffer could potentially help.

The main thing I've papered over here is the byte layout. Plotting a pixel at a given x/y position in the buffer is fairly complex for the above routines, since 16x16 sprites alternate columns each byte and the sprites themselves are laid out in another zigzag pattern. There are many arrangements possible that would help or hinder different parts of a sprite drawing routine, but I think the result will be rather complicated any way you slice it. Using 8x15 sprites instead of 16x16 could help, but then drawing will be somewhat slower. Everything's a tradeoff.

These complete examples are here: http://johnearnest.github.io/Octo/index.html?key=A6NUST1P

HostSubmitted (1 edit)

Thinking about it some more, I guess a simple merge isn't necessarily all that bad:

Takes up nearly a kilobyte and a half, though. Oof!

Or for a modest increase, you could composite an image under the control of a third masking buffer:


Submitted(+1)

Hey thanks man! Those look waaay more optimized than my routines. The main difference being that I was thinking my sprite drawing routine should do the heavy lifting of masking out the buffer and putting the sprites in place, instead of actually using multiple buffers. Seems a waste of cycles and memory otherwise, but maybe I'm not seeing something ;) I'll give it another spin, see if I can't get it sped up a bit with these ideas.

I also wrote a second implementation that doesn't use a buffer in memory, but that has some logic in the background drawing routine to mask out the foreground sprite. So the background just leaves a "hole" for the foreground to render into. Much less slow, but I'm a bit scared to go down that path because it requires a lot more bookkeeping in the background drawing routine, which may get hard to work with as the game gets more complicated...

HostSubmitted

I'll definitely be interested to hear about what you come up with!

Lots of possible tradeoffs depending on exactly what your needs are.

In general, when speed is of the utmost importance, macros and unrolled loops are your friends. One nice thing about this on XO-Chip is that while some of them take up a lot of memory, none of the above routines include any branches, so they could be located at the very end of the low 4k of address space and only consume one byte of precious "code RAM".