If you want guidance acceleration speedups (and token healing) then you have to use an open model locally right now, though we are working on setting up a remote server solution as well. I expect APIs will adopt some support for more control over time, but right now commercial endpoints like OpenAI are supported through multiple calls.
We manage the KV-cache in session based way that allows the LLM to just take one forward pass through the whole program (only generating the tokens it needs to)
We manage the KV-cache in session based way that allows the LLM to just take one forward pass through the whole program (only generating the tokens it needs to)