The Programming Massively Parallel Processors textbook has a chapter on histograms. You have basically mentioned the main trick though (privatization).
For this problem I think coursening would also help (so everything isn’t queued up on the copy to global memory)
The PMPP book is great. I reread the histogram chapter after finishing the blog, and realized I could use privatization. You got me!
By coarsening, do you mean making the threads handle more file parts, and reducing the number of private copies (of histogram or stats here) to globally commit at the end?
For this problem I think coursening would also help (so everything isn’t queued up on the copy to global memory)