Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

No, not in milliseconds if you have longish context. Prefill is very compute heavy, compared to inference.


Depends how you’re defining it. There can be a lot of it to ingest so it’s a lot of compute in absolute terms. It’s also much more memory efficient since it’s batchable so, it’s more likely to be compute bound, but you can also throw a lot of resources at the problem. But in terms of time generation can be significantly more expensive since it’s slower and you can’t batch (only use a draft model)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: