I’m not very well versed in this domain, but I think it’s not going to be “VRAM” (GDDR) memory, but rather “unified memory”, which is essentially RAM (some flavour of DDR5 I assume). These two types of memory has vastly different bandwidth.
I’m pretty curious to see any benchmarks on inference on VRAM vs UM.
I’m using VRAM as shorthand for “memory which the AI chip can use” which I think is fairly common shorthand these days. For the spark is it unified, and has lower bandwidth than most any modern GPU. (About 300 GB/s which is comparable to an RTX 3060.)
So for an LLM inference is relatively slow because of that bandwidth, but you can load much bigger smarter models than you could on any consumer GPU.
I’m pretty curious to see any benchmarks on inference on VRAM vs UM.