We released a Cosmos-Reason2-2B W4A16 + FlashHead build optimized for Jetson devices. FlashHead is a drop-in replacement for the LM head that increases token generation throughput without sacrificing reasoning quality, on top of techniques like quantization.
NVIDIA released Cosmos-Reason2 last month, targeting physical AI workloads (video reasoning, robotics planning, event detection), with official support for DGX Spark, H100, GB200 and Jetson AGX Thor.
We quantized the 2B model to W4A16 and optimized it further to run across the full Jetson lineup, including the most constrained Orin Nano 8GB Super (8 GB).
Interested in feedback from others deploying VLMs on Jetson, especially around serving stacks (vLLM vs TensorRT-LLM vs other approaches) and practical bottlenecks!
Try it with vllm-serve:
ssh <your-orin>
docker run --rm -it \ --network host \ --runtime=nvidia \ --name=vllm-serve \ -e HF_TOKEN=<YOUR_HUGGINGFACE_TOKEN_HERE> \ embedl/vllm:latest-jetson-orin-flashhead \ vllm serve "embedl/Cosmos-Reason2-2B-W4A16-Edge2-FlashHead" \ --gpu-memory-utilization 0.75 \ --trust-remote-code
curl localhost:8000/v1/chat/completions \ -H 'Content-Type: application/json' \ -d '{"model":"embedl/Cosmos-Reason2-2B-W4A16-Edge2-FlashHead","messages":[{"role":"user","content":"Hi"}]}' Jetson video inference benchmark (TPS with batch size = 1, 12 frames, 1280×720):
Device FP16 W4A16 FlashHead Orin Nano OOM 43.7 53.5 AGX Orin 39.6 74.4 92.2 AGX Thor 56.2 88.3 128.2 Model: https://huggingface.co/embedl/Cosmos-Reason2-2B-W4A16-Edge2-...
We’re Embedl, a research startup from Gothenburg, Sweden and the team behind FlashHead. Let us know what other models you’d like to see it applied to.