RTX 4090 vs H100 for LLM fine-tuning—cost vs performance tradeoff?
I'm spinning up a small lab for fine-tuning open-source LLMs and trying to figure out the sweet spot between hardware cost and training throughput.
Current setup options:
- Option A: 4x RTX 4090 ($15K total) with NVIDIA's NVLink bridge
- Option B: 2x H100 ($40K total) with full NVLink support
For typical fine-tuning workloads (7B-13B models, batch size 8-16), I'm seeing benchmark estimates of:
- 4x RTX 4090: ~520 TFLOPs mixed precision, ~96GB VRAM total
- 2x H100: ~1,456 TFLOPs mixed precision, ~80GB VRAM total
But the H100s are roughly 2.7x the cost. I'm running this on a Vultr bare metal setup with redundant fiber, so power/cooling isn't a constraint.
Has anyone done this comparison in production? Should I factor in:
- Memory bandwidth differences (nvlink vs pcie bottlenecks)?
- Multi-GPU scaling efficiency above 4 GPUs?
- Actual wall-clock time for typical fine-tuning jobs?
I'd rather invest in 4x 4090s and scale horizontally later if needed, but want to make sure I'm not hitting a practical limit. Thoughts?
Edited at 25 Mar 2026, 19:35
Have you factored in power/cooling costs? 4x 4090s will pull ~1.4kW sustained, H100s closer to 700W total. Over 2 years that's a real delta depending on your datacenter rates. Also consider: RTX 4090 has NVLink bridge limitations—you'll hit diminishing returns past 2-3 GPUs in parallel, whereas H100s actually scale linearly. For 7B-13B workloads you might be fine with 2x 4090s + cloud overflow honestly.
Good point on power costs—I hadn't fully mapped out the 2-year TCO. At our colocation rates (~$0.12/kWh), that 4x 4090 setup would be way more expensive to run. Leaning toward the H100s now, even if it's pricier upfront. Thanks for the reality check!