What is the performance-efficient way for a barrel shifter in DLS? I made a 32 bit one using ~1000 nand gates but no other logic components and without a LUT