This is the possible optimization that I mention at the end of the blog - using a private map for each thread block.
The catch is that this map must fit in shared memory, which is pretty limited on all current hardware: ~100KB.
I originally thought that my map (stats array) was too big to fit into this shared memory. Now, however, I realize it can. It'll interesting to see how much speedup (or not!) this optimization can bring.
The catch is that this map must fit in shared memory, which is pretty limited on all current hardware: ~100KB.
I originally thought that my map (stats array) was too big to fit into this shared memory. Now, however, I realize it can. It'll interesting to see how much speedup (or not!) this optimization can bring.