FlashAttention-2, 2x faster than FlashAttention

whimsicalism · on July 17, 2023

Does anyone have resources for a good way to get started with this sort of modern GPU systems work?

jahewson · on July 17, 2023

If you’d like a practical goal, you probably want to learn PyTorch and have a little background knowledge of the memory architecture of the GPUs. If you want to go deep, learn CUDA: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index....

whimsicalism · on July 17, 2023

Yes, I know pytorch well at this point and have basic memory architecture understanding. In the process of learning CUDA, but would love pointers for depth/intermediate things to explore.

jahewson · on July 17, 2023

I found this talk helpful. https://on-demand.gputechconf.com/gtc/2017/presentation/s712...

Have you tried the Visual Profiler yet?

luckyt · on July 17, 2023

I found it helpful to start with CUDA on numba since it lets you write GPU kernels in python. Assuming you're like most ML engineers and you're more familiar with python than C++, this allows you to separately learn CUDA concepts from also learning C++ at the same time. There's also a set of GPU puzzles for beginners [1] using to get started with numba CUDA.

[1] https://github.com/srush/GPU-Puzzles

whimsicalism · on July 17, 2023

Thanks for the link! Sasha is actually my former professor - if this is anything like his past pytorch puzzles I'm sure I'll find it enjoyable.

brrrrrm · on July 17, 2023

I'd start with the example of implementing the fastest reduction you possibly can. Pretty much all complexity in every kernel used in ML extends from this concept (reductions with addition).

https://developer.download.nvidia.com/assets/cuda/files/redu...

FL33TW00D · on July 18, 2023

I love this PDF, some crazy sticking power since he's referring to a G80 GPU there!

whimsicalism · on July 17, 2023

thank you for the suggestion - will take a look!

ternaus · on July 17, 2023

I would be very greatfull to see how one can leverage it not for LLMs but for Stable Diffusion models

m00x · on July 17, 2023

Why couldn't it be applied to SD?

mcyc · on July 18, 2023

SD1 uses a head dimension that is not supported by the old version of Flash Attention but is supported by this version.

m00x · on July 17, 2023

It looks like it's already at thing https://github.com/AUTOMATIC1111/stable-diffusion-webui/blob...

zaptrem · on July 17, 2023

That's Flash Attention 1, but yeah there's nothing stopping someone from submitting a PR.

hunkins · on July 17, 2023

This enables V1. V2 is still yet to be integrated into xformers. The team replied saying it should happen this week.

See the relevant Github issue here: https://github.com/facebookresearch/xformers/issues/795

lucidrains · on July 17, 2023

huge! thank you Tri!

bufo · on July 17, 2023

Tri Dao and Tim Dettmers ftw

hunkins · on July 17, 2023

Excellent work by the team on this. Excited to see this applied to the diffusers library.