I believe there have been studies showing that the attention mechanism allows estimation of gradients for one-shot learning (i.e, based on what you tell the model you want in the input, it will use attention to 'update' the weights of the linear layers to 'learn' new information). This seems to be taking that one step further and just using attention for the weight estimations itself. The key insight here is that by adding more tokens to the weight estimation calculation, you can get more degrees of freedom.
Total aside, but imagining how many levels of functions are present in the calculation of each activation here, and thinking about how regular old differentiation and gradient descent actually work to train these nested parameters, is truly amazing, in my opinion.
Total aside, but imagining how many levels of functions are present in the calculation of each activation here, and thinking about how regular old differentiation and gradient descent actually work to train these nested parameters, is truly amazing, in my opinion.