Embedl-Wilhelm's comments

Embedl-Wilhelm · 2026-03-13T12:17:10 1773404230

We released a Cosmos-Reason2-2B W4A16 + FlashHead build optimized for Jetson devices. FlashHead is a drop-in replacement for the LM head that increases token generation throughput without sacrificing reasoning quality, on top of techniques like quantization.

Try it with vllm-serve:

ssh <your-orin>

docker run --rm -it \ --network host \ --runtime=nvidia \ --name=vllm-serve \ -e HF_TOKEN=<YOUR_HUGGINGFACE_TOKEN_HERE> \ embedl/vllm:latest-jetson-orin-flashhead \ vllm serve "embedl/Cosmos-Reason2-2B-W4A16-Edge2-FlashHead" \ --gpu-memory-utilization 0.75 \ --trust-remote-code

curl localhost:8000/v1/chat/completions \ -H 'Content-Type: application/json' \ -d '{"model":"embedl/Cosmos-Reason2-2B-W4A16-Edge2-FlashHead","messages":[{"role":"user","content":"Hi"}]}' Jetson video inference benchmark (TPS with batch size = 1, 12 frames, 1280×720):

Device FP16 W4A16 FlashHead Orin Nano OOM 43.7 53.5 AGX Orin 39.6 74.4 92.2 AGX Thor 56.2 88.3 128.2 Model: https://huggingface.co/embedl/Cosmos-Reason2-2B-W4A16-Edge2-...

We’re Embedl, a research startup from Gothenburg, Sweden and the team behind FlashHead. Let us know what other models you’d like to see it applied to.

Embedl-Wilhelm · 2026-02-19T15:13:02 1771513982

NVIDIA released Cosmos-Reason2 last month, targeting physical AI workloads (video reasoning, robotics planning, event detection), with official support for DGX Spark, H100, GB200 and Jetson AGX Thor.

We quantized the 2B model to W4A16 and optimized it further to run across the full Jetson lineup, including the most constrained Orin Nano 8GB Super (8 GB).

Model, setup instructions, and benchmarks: https://huggingface.co/embedl/Cosmos-Reason2-2B-W4A16

Interested in feedback from others deploying VLMs on Jetson, especially around serving stacks (vLLM vs TensorRT-LLM vs other approaches) and practical bottlenecks!

Embedl-Wilhelm · 2026-02-21T12:20:21 1771676421

Quickstart (vLLM Jetson container):

-gpu-memory-utilization and --max-num-seqs should be adapted to system specifications (i.e., available RAM).

docker run --rm -it \ --network host \ --shm-size=8g \ --ulimit memlock=-1 \ --ulimit stack=67108864 \ --runtime=nvidia \ --name=vllm-serve \ ghcr.io/nvidia-ai-iot/vllm:latest-jetson-orin \ vllm serve "embedl/Cosmos-Reason2-2B-W4A16" \ --max-model-len 8192 \ --gpu-memory-utilization 0.75 \ --max-num-seqs 2