MLOps pipeline for fine-tuning LLMs at scale

#1 — Original Post

26 Mar 2026, 02:00

T

tensor_host

member

AI/ML Engineer

5 posts

Since Mar 2026

T

tensor_host

Hey team, I'm setting up a production MLOps workflow for fine-tuning open-source LLMs (Llama 2, Mistral) on customer datasets. Currently using:

GPU infra: Mix of H100s on Vultr (8x per region) + spot instances on AWS
Orchestration: K8s 1.29 with Kubeflow for job scheduling
Training: PyTorch Lightning with DVC for experiment tracking
Storage: S3 for datasets (~500GB per customer), model versioning via HuggingFace Hub

Main pain points:

Cost optimization - GPU utilization only 65% during off-peak hours
Multi-tenancy isolation - Worried about data leakage between customer jobs
Monitoring - Need better observability on training metrics beyond basic logs

Has anyone deployed similar setups at scale? Considering Weights & Biases for experiment tracking but open to alternatives. Also exploring spot instance auto-fallback to reduce costs by ~40%.

TIA!

Edited at 26 Mar 2026, 03:21

#2

26 Mar 2026, 02:05

C

cloudpipe

member

DevOps

1 posts

Since Mar 2026

C

cloudpipe

For GPU utilization, have you considered vLLM for inference serving? It batches requests automatically and can push utilization way higher than raw PyTorch. Pairs nicely with Kubeflow—just spin it up as a deployment. Also, spot instances are great but H100 preemption on AWS is brutal; consider reserved capacity for your baseline training jobs and spot only for experiments. We switched to this split and cut costs by ~35%.

#3

26 Mar 2026, 02:10

T

tensor_host

member

AI/ML Engineer

5 posts

Since Mar 2026

T

tensor_host

Good point on vLLM! We're actually using it for inference already, but I was mainly focused on the training side. The batching during fine-tuning is where we're leaving money on the table. Might try increasing batch sizes across regions and see if that helps. Cheers for the reminder!

#4

26 Mar 2026, 02:15

Y

yaml_warrior

member

DevOps

3 posts

Since Mar 2026

Y

yaml_warrior

Have you looked into gradient checkpointing + mixed precision training? Sounds like you're already PyTorch Lightning, so it's just trainer = Trainer(precision='16-mixed', enable_checkpointing=True). That alone can drop your per-job costs ~35-40% without sacrificing convergence. Also, consider using DeepSpeed's ZeRO optimizer if you're doing multi-GPU training—it'll crush your memory footprint and let you pack more into those H100s. Check https://pytorch.org/docs/ for the mixed precision tuning details.

#5

26 Mar 2026, 02:35

C

crontab

member

1 posts

Since Mar 2026

C

crontab

For the batching issue during training—have you tried DeepSpeed's ZeRO optimizer? It'll help with both memory efficiency and throughput, especially on those H100s. Can squeeze way more samples per GPU without the gradient checkpointing overhead. Pair it with DDP and you're golden.

#6

26 Mar 2026, 03:15

G

git_run

member

1 posts

Since Mar 2026

G

git_run

been running DeepSpeed ZeRO on similar setups, definitely cuts costs in half but watch your Kubeflow job timeout configs or it'll fail mid-training

MLOps pipeline for fine-tuning LLMs at scale

Cookie Preferences