Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
FlashAttention-2, 2x faster than FlashAttention (twitter.com/tri_dao)
82 points by machdiamonds on July 17, 2023 | hide | past | favorite | 18 comments


Does anyone have resources for a good way to get started with this sort of modern GPU systems work?


If you’d like a practical goal, you probably want to learn PyTorch and have a little background knowledge of the memory architecture of the GPUs. If you want to go deep, learn CUDA: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index....


Yes, I know pytorch well at this point and have basic memory architecture understanding. In the process of learning CUDA, but would love pointers for depth/intermediate things to explore.


I found this talk helpful. https://on-demand.gputechconf.com/gtc/2017/presentation/s712...

Have you tried the Visual Profiler yet?


I found it helpful to start with CUDA on numba since it lets you write GPU kernels in python. Assuming you're like most ML engineers and you're more familiar with python than C++, this allows you to separately learn CUDA concepts from also learning C++ at the same time. There's also a set of GPU puzzles for beginners [1] using to get started with numba CUDA.

[1] https://github.com/srush/GPU-Puzzles


Thanks for the link! Sasha is actually my former professor - if this is anything like his past pytorch puzzles I'm sure I'll find it enjoyable.


I'd start with the example of implementing the fastest reduction you possibly can. Pretty much all complexity in every kernel used in ML extends from this concept (reductions with addition).

https://developer.download.nvidia.com/assets/cuda/files/redu...


I love this PDF, some crazy sticking power since he's referring to a G80 GPU there!


thank you for the suggestion - will take a look!


I would be very greatfull to see how one can leverage it not for LLMs but for Stable Diffusion models


Why couldn't it be applied to SD?


SD1 uses a head dimension that is not supported by the old version of Flash Attention but is supported by this version.



That's Flash Attention 1, but yeah there's nothing stopping someone from submitting a PR.


This enables V1. V2 is still yet to be integrated into xformers. The team replied saying it should happen this week.

See the relevant Github issue here: https://github.com/facebookresearch/xformers/issues/795


huge! thank you Tri!


Tri Dao and Tim Dettmers ftw


Excellent work by the team on this. Excited to see this applied to the diffusers library.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: