Oh and BTW (also sorry for the text avalanche), as I'm thinking about it, I may as well try an entirely different approach that uses only 1 render: do a point sample mini copy of the main render from the backbuffer, eg. 256x256 pixels, then scale them down to 128x128 while smoothing the point samples (blur-shrink), probably using inline assembler (like you can in gfa basic, and probably freebasic that can make it a dll that then can be used in Blitz3D via userlib decls). And when reading the point samples from the render, ignore anything that isn't bright enough (or lower it to rgb 0).
Ok, in case anyone is interested: I tried the above idea (not the ASM part) and two things became clear: first of all, doing only one render inevitably causes a recursive feedback, because the points sampled are brightened and sampled again repeatedly, forcing me to use a very low alpha, but even then it stabilizes only due to rounding errors, causing it to flicker wildly. So I concluded there is no way around a 2nd render.
However, I found a much faster way: Render the scene without the rays mesh, full display size. Then do the point sample from the backbuffer and move it to the ray mesh texture. Then move the camera 10000 units away, where the entire scene is out of rendering range (the ray mesh is parented to the camera), set the cameraClsMode to maintain the backbuffer and now render the ray mesh alone ontop of it. Then set CameraClsMode to 1,1 again and move camera back to the scene. I was able to lower the rendering time of the effect from 23 to 16 ms - still very slow.
That's when I figured out the second thing: from the 16 ms about 13 ms were used only by the commands lockbuffer backbuffer() and unlockbuffer backbuffer() ! I tried it with no fastpixelreading, it took 13ms, then also without lockbuffer and it went down to like 1ms.
So the main bottleneck seems to be lockbuffer. It seems to wait for some green light from directX, which is in sync with the system framerate. I tried VWait right before lockbuffer and was able to lower it from 16 to 5 ms. But VWait should always be followed by flip 0, if used at all. Maybe I'll upload the source.
Very interesting and cool insights you got there, as always! I'm curious about the freebasic or inline assembly way to make it faster as I would presume this is how FastExt does this effect.
There is also this one idea that I am very interested with the outcome from Fredborg which RemiD described before that you might look into below. I guess you might use some form of light trails effect for the rays and perhaps you can have a go at it! ๐
"the idea was to have a subdivided quad parented to the camera, have its vertices colored with the sun color, and use linepicks from the sun to each vertex, and set the vertices alphas accordingly (if a light ray can reach a vertex, alpha 0.5, if a light ray can't reach a vertex, alpha 0)
with blendmode add or multiply2..."
Maybe I'm wrong, but as far as I remember Blitz3D doesn't support Render to Texture. The addition of Flag 256 allowed for faster access ("Store texture in vram"), but no direct render to texture. Copyrect from backbuffer to vram-texture is rather fast tho. Lockbuffer is slow and the actual problem. A way around would be:
render actual scene, without rays.
render mini version 256x256 and copyrect to texturebuffer (optimized with details like grass hidden)
render rays mesh alone ontop
This 3 render method does not allow for pixel access, so I'd use EntityColor 0,0,0 for shadow casting things, like in the source. Most important: no lockbuffer required. That should lower the 16 ms to 2 ms. The beauty of extracting the texture from the first render using readpixel etc. was, that its speed is independent from scene complexity, but well, it required lockbuffer, so..
edit: I just tried that, not significantly faster, unless I did something wrong. Then there was also a hack, to directly peek/poke VRam and Buffers, I guess using the memory.lib. Most likely not very stabile.