Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I believe there have been studies showing that the attention mechanism allows estimation of gradients for one-shot learning (i.e, based on what you tell the model you want in the input, it will use attention to 'update' the weights of the linear layers to 'learn' new information). This seems to be taking that one step further and just using attention for the weight estimations itself. The key insight here is that by adding more tokens to the weight estimation calculation, you can get more degrees of freedom.

Total aside, but imagining how many levels of functions are present in the calculation of each activation here, and thinking about how regular old differentiation and gradient descent actually work to train these nested parameters, is truly amazing, in my opinion.



Yeah. This thing is "assembling a different transformer" on the spot for each token.

If one thinks about it for more than a moment, it's kind of incredible that it works.


I think the same about regular neutral networks




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: