Gates are fun when you are starting out in digital stuff, but if you want to step up your game, you'll gonna need to learn about LUTs.
The arrangement in the video is like this:
2 levels.
1st level handles 4bits of input,shifting by 12,8 or 4 or 0 (another 2 bits) + direction(1 bit) + roll/shifing(1 bit).
The nxt level does the same but shift by 3,2,1 or 0 and the rest of the inputs are the same.
Then between levels, i do an 8xOR to move data to the correct ROM.In your case, i'd guess that you need 8 roms in the first level (shift by 8,16,24)+ another 8 to multiplex to higher/lower strata of your total output, plus 16 to or them all into the correct ROM chip, then another level for (2/4/6) plus 8 roms plus 16 roms. Also note that 8-1 and 1-8 split mergers take roughly the same toll on performance as a nand or a rom chip.
In your case i guess it'd be better to shift/rotate through carry once like in the atmega/ arm architecture and then repeat the process x times, but it might be worth to see the possibility of rom approach in 64 bit, although im not mad enough to waste my time and do it.
Also there could be a more efficient way to do it, like if you want to shift by 15, you roll by -1 and block the 15 next inputs. Ill try it after i finish some other stuff
As a general rule of thumb, it's almost always better to use a LUT for anything. I mean... a single or operation between to operands would take one 8-1 mergerr, one rom and then an 1-8 splitter, or 3 nand gates. But a XOR would take the same merger,splitter and rom, or 4 nands. So even for this thing, a rom is better performance wise. The only situation were gates are better, is when you want to invert up to 2 lines, or and 2 ops.
Ill make a video soon about how to use them and how to think in state machines in general