Ok, in case anyone is interested: I tried the above idea (not the ASM part) and two things became clear: first of all, doing only one render inevitably causes a recursive feedback, because the points sampled are brightened and sampled again repeatedly, forcing me to use a very low alpha, but even then it stabilizes only due to rounding errors, causing it to flicker wildly. So I concluded there is no way around a 2nd render.
However, I found a much faster way: Render the scene without the rays mesh, full display size. Then do the point sample from the backbuffer and move it to the ray mesh texture. Then move the camera 10000 units away, where the entire scene is out of rendering range (the ray mesh is parented to the camera), set the cameraClsMode to maintain the backbuffer and now render the ray mesh alone ontop of it. Then set CameraClsMode to 1,1 again and move camera back to the scene. I was able to lower the rendering time of the effect from 23 to 16 ms - still very slow.
That's when I figured out the second thing: from the 16 ms about 13 ms were used only by the commands lockbuffer backbuffer() and unlockbuffer backbuffer() ! I tried it with no fastpixelreading, it took 13ms, then also without lockbuffer and it went down to like 1ms.
So the main bottleneck seems to be lockbuffer. It seems to wait for some green light from directX, which is in sync with the system framerate. I tried VWait right before lockbuffer and was able to lower it from 16 to 5 ms. But VWait should always be followed by flip 0, if used at all. Maybe I'll upload the source.