MLOps pipeline for fine-tuning LLMs at scale
Hey team, I'm setting up a production MLOps workflow for fine-tuning open-source LLMs (Llama 2, Mistral) on customer datasets. Currently using:
- GPU infra: Mix of H100s on Vultr (8x per region) + spot instances on AWS
- Orchestration: K8s 1.29 with Kubeflow for job scheduling
- Training: PyTorch Lightning with DVC for experiment tracking
- Storage: S3 for datasets (~500GB per customer), model versioning via HuggingFace Hub
Main pain points:
- Cost optimization - GPU utilization only 65% during off-peak hours
- Multi-tenancy isolation - Worried about data leakage between customer jobs
- Monitoring - Need better observability on training metrics beyond basic logs
Has anyone deployed similar setups at scale? Considering Weights & Biases for experiment tracking but open to alternatives. Also exploring spot instance auto-fallback to reduce costs by ~40%.
TIA!
Edited at 26 Mar 2026, 03:21
For GPU utilization, have you considered vLLM for inference serving? It batches requests automatically and can push utilization way higher than raw PyTorch. Pairs nicely with Kubeflow—just spin it up as a deployment. Also, spot instances are great but H100 preemption on AWS is brutal; consider reserved capacity for your baseline training jobs and spot only for experiments. We switched to this split and cut costs by ~35%.
Good point on vLLM! We're actually using it for inference already, but I was mainly focused on the training side. The batching during fine-tuning is where we're leaving money on the table. Might try increasing batch sizes across regions and see if that helps. Cheers for the reminder!
Have you looked into gradient checkpointing + mixed precision training? Sounds like you're already PyTorch Lightning, so it's just trainer = Trainer(precision='16-mixed', enable_checkpointing=True). That alone can drop your per-job costs ~35-40% without sacrificing convergence. Also, consider using DeepSpeed's ZeRO optimizer if you're doing multi-GPU training—it'll crush your memory footprint and let you pack more into those H100s. Check https://pytorch.org/docs/ for the mixed precision tuning details.
For the batching issue during training—have you tried DeepSpeed's ZeRO optimizer? It'll help with both memory efficiency and throughput, especially on those H100s. Can squeeze way more samples per GPU without the gradient checkpointing overhead. Pair it with DDP and you're golden.
been running DeepSpeed ZeRO on similar setups, definitely cuts costs in half but watch your Kubeflow job timeout configs or it'll fail mid-training