Well, they can be 40 GB fp32 models technically, but translating a model trained in fp32 to fp16 is not trivial (trust me, we’re working on this right now for a model). But remember that training the model requires a lot more memory than just the model parameters, because you need to store the gradients as well.