The arrangement in the video is like this:
2 levels.
1st level handles 4bits of input,shifting by 12,8 or 4 or 0 (another 2 bits) + direction(1 bit) + roll/shifing(1 bit).
The nxt level does the same but shift by 3,2,1 or 0 and the rest of the inputs are the same.
Then between levels, i do an 8xOR to move data to the correct ROM.In your case, i'd guess that you need 8 roms in the first level (shift by 8,16,24)+ another 8 to multiplex to higher/lower strata of your total output, plus 16 to or them all into the correct ROM chip, then another level for (2/4/6) plus 8 roms plus 16 roms. Also note that 8-1 and 1-8 split mergers take roughly the same toll on performance as a nand or a rom chip.
In your case i guess it'd be better to shift/rotate through carry once like in the atmega/ arm architecture and then repeat the process x times, but it might be worth to see the possibility of rom approach in 64 bit, although im not mad enough to waste my time and do it.
Also there could be a more efficient way to do it, like if you want to shift by 15, you roll by -1 and block the 15 next inputs. Ill try it after i finish some other stuff