Connect them as you had in picture you posted, but single clock to all of them instead of pulses. If it goes wrong, post inside of the DFF
Arnadath
Recent community posts
What im noticing
1) You are using latches. You wanna use master-slave D-flip flops, to output previous state while assuming current state, which brings me to...
2)Your clocking mechanism should be an actual clock instead of a pulse and you want to clock the whole mechanism instead of individual ff's. Maybe connect a clock and the pulse to an and and have it pulse once or thrice like you want. In any case if you want more help, show the inside of the D-Latch
You can make a 3 bit d-flip flop with a single rom chip.
Connect everything as shown in the image below and then go to: https://pastebin.com/MaKdTNkP , copy the binary, and paste it to the rom.
Great for reducing component count

As a general rule of thumb, it's almost always better to use a LUT for anything. I mean... a single or operation between to operands would take one 8-1 mergerr, one rom and then an 1-8 splitter, or 3 nand gates. But a XOR would take the same merger,splitter and rom, or 4 nands. So even for this thing, a rom is better performance wise. The only situation were gates are better, is when you want to invert up to 2 lines, or and 2 ops.
Ill make a video soon about how to use them and how to think in state machines in general
The arrangement in the video is like this:
2 levels.
1st level handles 4bits of input,shifting by 12,8 or 4 or 0 (another 2 bits) + direction(1 bit) + roll/shifing(1 bit).
The nxt level does the same but shift by 3,2,1 or 0 and the rest of the inputs are the same.
Then between levels, i do an 8xOR to move data to the correct ROM.In your case, i'd guess that you need 8 roms in the first level (shift by 8,16,24)+ another 8 to multiplex to higher/lower strata of your total output, plus 16 to or them all into the correct ROM chip, then another level for (2/4/6) plus 8 roms plus 16 roms. Also note that 8-1 and 1-8 split mergers take roughly the same toll on performance as a nand or a rom chip.
In your case i guess it'd be better to shift/rotate through carry once like in the atmega/ arm architecture and then repeat the process x times, but it might be worth to see the possibility of rom approach in 64 bit, although im not mad enough to waste my time and do it.
Also there could be a more efficient way to do it, like if you want to shift by 15, you roll by -1 and block the 15 next inputs. Ill try it after i finish some other stuff