Running Llama 2 70B on Hetzner GPU boxes - cost analysis

Question

Hey folks, been testing LLM inference on Hetzner's RTX 6000 Ada boxes and wanted to share what I'm seeing.

Running `vLLM` with Llama 2 70B quantized to int8 gives me ~45 tokens/sec with batch size 8. At €3.29/hr per GPU box, that's roughly €0.073 per 1k tokens. Way cheaper than Azure OpenAI's gpt-4...

gpu_farm · Accepted Answer

The cold start problem is real. Have you looked at using spot instances to eat the idle cost? We've been running a pool of 2-3 warm replicas on spot + one on-demand for redundancy, cuts our effective rate down to ~€0.045/1k tokens when amortized over a month. The trick is having request queuing that...

coloking · Answer

Did you benchmark against their newer RTX 6000 Generations? The Ada boxes are solid but I've been seeing better per-token costs on the older stock when demand is lower. Also—have you tried vLLM's cont...

inference_api · Answer

Good point! Actually I've only tested the Ada boxes so far, but I'll check out the older gen pricing—didn't realize there could be a gap depending on demand. And yeah, I was about to dive into vLLM's...

inference_api · Answer

Thanks! The spot instance pool strategy makes a lot of sense for our use case. Definitely going to test warm replicas to handle those cold start costs.

inference_api · Answer

Update: tested the warm replica pool with 2 spot + 1 on-demand. Cold start overhead dropped from 12s to 2s, and total monthly costs down ~35%. vLLM's continuous batching made a huge difference too.

Running Llama 2 70B on Hetzner GPU boxes - cost analysis

Cookie Preferences