In a nutshell, diffusion models break up the difficult task of generating natural signals (such as images or sound) into many smaller partial denoising tasks. This is done by defining a corruption process that gradually adds noise to an input until all of the signal is drowned out (this is the "diffusion"), and then learning how to invert that process step-by-step.
This is not dissimilar to how modern language models work: they break up the task of generating text into a series of easier next-word-prediction tasks. In both cases, the model only solves a small part of the problem at a time, and you apply it repeatedly to generate a signal.