Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Semi-related but is there a standard way to run this (or other models from huggingface) in a docker container and interact with them through a web API? ChatGPT tells me to write my own FastAPI wrapper which should work, but is there no pre-made solution for this?


Ollama has built in support [1] for gguf models on huggingface, and exposes a openai compatible http endpoint [2].

You can also just test it out using the cli:

ollama run hf.co/unsloth/SmolLM2-1.7B-Instruct-GGUF:F16

1. https://huggingface.co/docs/hub/ollama

2. https://github.com/ollama/ollama?tab=readme-ov-file#start-ol...


Thanks, Ollama ist exactly what I was looking for.


In shell 1:

  $ docker run --runtime nvidia --gpus all \
      -v ~/.cache/huggingface:/root/.cache/huggingface \
      --env "HUGGING_FACE_HUB_TOKEN=<secret>" \
      -p 8000:8000 \
      --ipc=host \
      vllm/vllm-openai:latest \
      --model mistralai/Mistral-7B-v0.1
In shell 2:

  $ curl http://localhost:8000/v1/completions \
      -H "Content-Type: application/json" \
      -d '{
        "model": "mistralai/Mistral-7B-v0.1",
        "prompt": "San Francisco is a",
        "max_tokens": 7,
        "temperature": 0
      }'


Love vLLM for how fast it is while also being easy to host.


Huggingface TGI supports many models and more than one API:

https://huggingface.co/docs/text-generation-inference/en/ins...


llama.cpp in a docker container (Google for the gguf version)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: