As it turns out, I decided to bite the bullet and go with real hardware, and by real hardware I mean an emulator but *theoretically* it would work on the real hardware.
No pixel alignment problems here!
Current situation: If I'm going to have more than a 16x16-tile level I'm going to need to get scrolling working. I'm already banging 64 bytes into VRAM each frame for the player sprite, I have just enough room in vblank (which is 10 scanlines) for another 64 if necessary but I think I can safely fit 32 in if I precalc the tilemap data outside of vblank (that is, 144 scanlines).
On the GB you cannot touch VRAM during active video. Technically I could bang bytes in during hblank but that complicates things and seriously eats into the outside-of-vblank period. The vblank period is the longest continuous length of time in a frame (well, technically between two frames) where you can write to VRAM, and that's why the timing there is so precious. You've got 4560 cycles of vblank, which realistically (assuming an average of 8 cycles per instruction) is about 570 instructions executed.