> Transformers are the dominant architecture in AI, yet why they work remains poorly understood. This paper offers a precise answer: a transformer is a Bayesian network.
Why would being a Bayesian network explain why transformers work? Bayesian networks existed long before transformers and never achieved their performance.
Bayesian network is a really general concept. It applies to all multidimensional probability distribution. It's a graph that encodes independence between variables. Ish.
I have not taken the time to review the paper, but if the claim stands, it means we might have another tool to our toolbox to better understand transformers.
Why would being a Bayesian network explain why transformers work? Bayesian networks existed long before transformers and never achieved their performance.