Insight

When a private LLM actually makes sense

Running a private LLM used to be a compromise. The models were worse, the tooling was rough, and the only reason to do it was compliance. That changed. Open-weight models now perform within 2% of closed APIs on most benchmarks. Kubernetes runs 66% of production inference workloads. The serving engines are real. Private deployment is a genuine engineering choice, not a fallback.

A private large language model (LLM) is one that runs on infrastructure you control instead of calling someone else's API. Your own hardware or an environment managed for you, either works.

Most workloads are fine on a third-party API. The ones that aren't tend to share a few traits: the data can't leave, the model needs to reach internal systems, or a regulator needs to see how it runs. If two of those apply, you're looking at a self-hosting candidate.

We mapped 29 of those workloads across six patterns. Below is the decision framework.

01 · Workload Atlas

Which AI workloads fit self-hosted inference

Not every AI workload needs its own infrastructure. These six patterns are the ones where it tends to matter — because of the data involved, the systems the model needs to reach, or what you need to prove after the fact.

02 · Why Now

Why private deployment is viable now

Open-source and open-weight models closed the benchmark gap from 8% to 1.7% in a single year. For many workloads, an open-weight model running inside your boundary now performs close enough to a frontier API that capability is no longer the deciding factor. SOURCE

82% of container users now run Kubernetes in production. 66% use it for generative AI inference. The infrastructure for running inference, as distinct from model training, is mature. Self-hosting conversations now sound less like model experiments and more like platform work: GPU scheduling, serving capacity, rollout strategy, observability, operational discipline. SOURCE

The shift in numbers

What changed across capability, infrastructure, and adoption

8% → 1.7%

benchmark gap reduction

Open-weight releases improved fast enough that proprietary stacks no longer hold a prohibitive edge on those benchmarks.

82%

run Kubernetes in production

Up from 66% in 2023—Kubernetes is now the default production substrate for container users, not a pilot.

66%

use Kubernetes for genAI inference

Inference runs on the same control planes teams already operate—rollout, capacity, and observability read as platform work, not a greenfield experiment.

2

providers on average

60% had switched LLM vendor in the last six months—multi-provider setups are normal, not an edge case.

60%

cite accuracy / hallucinations

Engineering teams rank wrong answers ahead of slow or expensive ones—latency and cost trail at 23% each in the same survey.

60%

use RAG / vector databases

Fine-tuning stayed in the single digits—most teams reach for retrieval and context before retraining.

Governance caught up too. The EU AI Act entered into force on 1 August 2024. GPAI obligations started applying 2 August 2025. The general application date is 2 August 2026. Combined with GDPR transfer constraints, running on-premise LLM inference inside a controlled environment is no longer just an engineering preference. For regulated workloads, it is becoming a requirement. SOURCESOURCE+1

Regulation

EU AI Act timeline

The timeline is concrete enough to change deployment decisions.

Aug 1, 2024

EU AI Act enters into force

Feb 2, 2025

Rules on prohibited practices and AI literacy start applying

Aug 2, 2025

GPAI obligations start applying

Aug 2, 2026

General application date

The AI Act is an infrastructure decision

The AI Act splits organizations into providers and deployers. Providers of high-risk AI systems face conformity assessment, CE marking, technical documentation, and 10-year record-keeping. Deployers must log, monitor, and maintain human oversight — real work, but a different order of magnitude. SOURCE SOURCE

Which role you hold depends on what you do with the model. Run an open-weight model on your own infrastructure without materially changing it, and you stay on the deployer side. The Commission's GPAI guidance sets the boundary: you become a provider only when modification exceeds one-third of the original training compute. Standard customization sits well below that. SOURCESOURCE+1

Customization and role boundary

How common customization techniques relate to provider status under the AI Act.
TechniqueTriggers provider status?Why
RAGNoThe model stays untouched.
Prompt engineeringNoIt shapes the question, not the model.
QuantizationNoCompression is not retraining.
LoRA / lightweight fine-tuningUnlikelyThis is typically well below the one-third compute threshold.
Heavy fine-tuningPossiblyThis is the point where the threshold is worth checking carefully.

RAG-first with optional lightweight fine-tuning keeps most teams on the deployer side. SOURCE

Four obligations that need stack control

The Act never says "self-host." But four deployer obligations point there.

You must keep logs

Retain automatically generated logs for at least six months. On your cluster, that is a pipeline you configure. On an API, you get what the provider exposes.

You must govern your data

Where the deployer controls input data, it must be relevant and representative for the intended purpose. Self-hosting gives you the full path: what enters, how it is processed, what comes out.

You must be able to stop it

People must be able to understand output, reject it, and stop the system. Easier to build when you control the deployment.

You must monitor it

Monitor operation, respond without undue delay to risks or incidents. Self-hosted: direct visibility. API: outside view only.

The pattern across all four: the Act asks for capabilities that map to infrastructure control. A contract is a legal promise. Operating the stack is an operational fact. SOURCE

GDPR compounds it

When prompts contain personal data and reach a US-based API, GDPR Chapter V becomes an architecture constraint. The EU-US Data Privacy Framework survived its first court test in September 2025, but an appeal was brought in October and published in the Official Journal in December. The transfer basis is live but unsettled. SOURCESOURCE+1SOURCE+2

The logs the AI Act requires can themselves contain personal data — with their own lawful basis, retention policy, and access controls. Self-hosting on EU infrastructure removes the transfer layer entirely. Data that never leaves does not trigger it. SOURCESOURCE+1

03 · Routing

Self-host, hybrid, or API

Not every workload maps cleanly. Some need frontier capability that only closed models provide today. Others combine sensitivity with burst demand. The routing logic is not absolute. It is a portfolio decision.

Self-host

Data or prompts must stay inside your boundary.

Need model to act inside systems of record / internal tools.

Traffic is steady/high-volume or latency-critical.

RAG over private docs is the main value.

Hybrid

Sensitive and repeatable paths deserve control.

Frontier-only capability or burst capacity still matters.

Provider switching is common enough that portability matters.

API

Fastest time-to-value matters most.

Frontier-only capability is the deciding factor.

Platform maturity is not there yet.

If your workload matches one of the patterns in the workload atlas, the next step is understanding which models fit it and what the infrastructure looks like. Explore the model pages to see what runs on self-hosted Kubernetes.

04 · Infrastructure

From inference engine to infrastructure

Self-hosted inference is a spectrum. At one end, a developer pulls a model and runs it on a single machine. At the other, a platform team operates multiple models across workloads on managed Kubernetes. Most of the distance between those two points is not about the model. It is about what sits around it.

The workloads here sit at the platform end. They need concurrent request handling, structured outputs, and observability on GPU hardware. That means a serving engine behind a layer that manages the traffic.

Ollama

Pull and run

1 user, 1 GPU

llama.cpp

Runs on anything

CPU, Metal, CUDA

vLLM

Handles real traffic

Concurrent users

SGLang

Structured output

Agent pipelines

KServe

Multi-model on K8s

Scale-to-zero, canary

Explore

Most production deployments start here

Orchestrate

The serving engine is one layer. Production workloads also need logging pipelines, deployment controls, monitoring, rollout strategy, and audit trails that satisfy regulatory requirements. Those are not engine features. They are infrastructure features. They come from the orchestration layer: Kubernetes.

Explore

Ollama, llama.cpp. One person, one machine. Good for evaluating models. Not where production workloads run.

Serve

vLLM, SGLang. Inference engines on GPU hardware that handle concurrent users and batch jobs. Where most production deployments begin.

Orchestrate

Kubernetes. Multiple models, deployment controls, staged rollouts, observability, access management. Where running models becomes operating infrastructure.

The step from serve to orchestrate is the infrastructure decision. The team picks the model and the engine. The platform underneath is what turns those choices into workloads you can operate, monitor, and answer for.

Next step

See the infrastructure behind private deployment

The workloads above run on GPU Kubernetes clusters. The AI infrastructure page covers GPU options, deployment architecture, and how we manage the serving layer.