Running Cosmos-Reason2-2B on 8GB Jetson Orin Nano

Embedl-Wilhelm · 2026-02-19T15:13:02 1771513982

NVIDIA released Cosmos-Reason2 last month, targeting physical AI workloads (video reasoning, robotics planning, event detection), with official support for DGX Spark, H100, GB200 and Jetson AGX Thor.

We quantized the 2B model to W4A16 and optimized it further to run across the full Jetson lineup, including the most constrained Orin Nano 8GB Super (8 GB).

Model, setup instructions, and benchmarks: https://huggingface.co/embedl/Cosmos-Reason2-2B-W4A16

Interested in feedback from others deploying VLMs on Jetson, especially around serving stacks (vLLM vs TensorRT-LLM vs other approaches) and practical bottlenecks!

Embedl-Wilhelm · 2026-02-21T12:20:21 1771676421

Quickstart (vLLM Jetson container):

-gpu-memory-utilization and --max-num-seqs should be adapted to system specifications (i.e., available RAM).

docker run --rm -it \ --network host \ --shm-size=8g \ --ulimit memlock=-1 \ --ulimit stack=67108864 \ --runtime=nvidia \ --name=vllm-serve \ ghcr.io/nvidia-ai-iot/vllm:latest-jetson-orin \ vllm serve "embedl/Cosmos-Reason2-2B-W4A16" \ --max-model-len 8192 \ --gpu-memory-utilization 0.75 \ --max-num-seqs 2