I have not tried this yet, but theoretically, if you are using pixel fonts and assuming that the camera that shows the render texture is orthographic, you can just "align" the pixels of the font to the render texture. Going with that, this can also work with any artwork. Doing so, the UI is in front of the render texture (assuming that the Canvas is set to "Screen Space - Camera")
EDIT: the other idea is if you are rendering your UI to the render texture as well, you could add trigger colliders to the render texture, but that can be tedious to setup especially if you have animated UI. A possible solution is to have static "hotspots" on the render texture, and just swap the UI that fits into those hotspots.