Running Llama 2 70B on Hetzner GPU boxes - cost analysis
Hey folks, been testing LLM inference on Hetzner's RTX 6000 Ada boxes and wanted to share what I'm seeing.
Running vLLM with Llama 2 70B quantized to int8 gives me ~45 tokens/sec with batch size 8. At €3.29/hr per GPU box, that's roughly €0.073 per 1k tokens. Way cheaper than Azure OpenAI's gpt-4 tier, but you're managing your own scaling.
Where it gets messy: cold start overhead. If you're not keeping the instance hot, you're paying for that idle time. We've been load-testing with Nginx as a simple load balancer in front of 3 boxes, but we're looking at Vllm's built-in batching queue next.
Anyone else running inference workloads at scale? Curious if you're seeing better margins elsewhere or if you've solved the cold-start problem elegantly.
Also: Hetzner's support ghosted me once on a CUDA driver issue, so heads up there.
Edited at 26 Mar 2026, 20:20
Did you benchmark against their newer RTX 6000 Generations? The Ada boxes are solid but I've been seeing better per-token costs on the older stock when demand is lower. Also—have you tried vLLM's continuous batching with a request queue instead of fixed batch size 8? Could squeeze more efficiency out of that throughput. Check their docs on scheduling: https://docs.vllm.ai/ The cold start tax is real though; we mitigated it by keeping a background job polling the endpoint every 30s to prevent shutdown. Costs an extra €0.15/hr but kills the latency variance.
Good point! Actually I've only tested the Ada boxes so far, but I'll check out the older gen pricing—didn't realize there could be a gap depending on demand. And yeah, I was about to dive into vLLM's continuous batching settings, so curious what you were gonna say there. Cheers!
The cold start problem is real. Have you looked at using spot instances to eat the idle cost? We've been running a pool of 2-3 warm replicas on spot + one on-demand for redundancy, cuts our effective rate down to ~€0.045/1k tokens when amortized over a month. The trick is having request queuing that can handle 30-60s drain time when spots get yanked. Also worth checking vLLM's continuous batching docs—enabling that bumped our throughput by ~20% without extra GPU overhead: https://docs.vllm.ai/
Edited at 26 Mar 2026, 19:40
Thanks! The spot instance pool strategy makes a lot of sense for our use case. Definitely going to test warm replicas to handle those cold start costs.
Update: tested the warm replica pool with 2 spot + 1 on-demand. Cold start overhead dropped from 12s to 2s, and total monthly costs down ~35%. vLLM's continuous batching made a huge difference too.