This is the possible optimization that I mention at the end of the blog - using ...

This is the possible optimization that I mention at the end of the blog - using a private map for each thread block.

The catch is that this map must fit in shared memory, which is pretty limited on all current hardware: ~100KB.

I originally thought that my map (stats array) was too big to fit into this shared memory. Now, however, I realize it can. It'll interesting to see how much speedup (or not!) this optimization can bring.