Semi-related but is there a standard way to run this (or other models from huggi...

turblety · on Nov 2, 2024

Ollama has built in support [1] for gguf models on huggingface, and exposes a openai compatible http endpoint [2].

You can also just test it out using the cli:

ollama run hf.co/unsloth/SmolLM2-1.7B-Instruct-GGUF:F16

1. https://huggingface.co/docs/hub/ollama

2. https://github.com/ollama/ollama?tab=readme-ov-file#start-ol...

echoangle · on Nov 2, 2024

Thanks, Ollama ist exactly what I was looking for.

pizza · on Nov 2, 2024

In shell 1:

  $ docker run --runtime nvidia --gpus all \
      -v ~/.cache/huggingface:/root/.cache/huggingface \
      --env "HUGGING_FACE_HUB_TOKEN=<secret>" \
      -p 8000:8000 \
      --ipc=host \
      vllm/vllm-openai:latest \
      --model mistralai/Mistral-7B-v0.1

In shell 2:

  $ curl http://localhost:8000/v1/completions \
      -H "Content-Type: application/json" \
      -d '{
        "model": "mistralai/Mistral-7B-v0.1",
        "prompt": "San Francisco is a",
        "max_tokens": 7,
        "temperature": 0
      }'

Tostino · on Nov 2, 2024

Love vLLM for how fast it is while also being easy to host.

ttyprintk · on Nov 2, 2024

Huggingface TGI supports many models and more than one API:

https://huggingface.co/docs/text-generation-inference/en/ins...

exe34 · on Nov 2, 2024

llama.cpp in a docker container (Google for the gguf version)