Insight

When a private LLM actually makes sense

Running a private LLM used to be a compromise. The models were worse, the tooling was rough, and the only reason to do it was compliance. That changed. Open-weight models now perform within 2% of closed APIs on most benchmarks. Kubernetes runs 66% of production inference workloads. The serving engines are real. Private deployment is a genuine engineering choice, not a fallback.

6 workload patterns

29 use cases

7 min read

Chapter-01

29 workloads that fit private inference

Chapter-02

Why private deployment is viable now

Chapter-03

When to go private, hybrid, or API

Chapter-04

What the infrastructure stack looks like

A private large language model (LLM) is one that runs on infrastructure you control instead of calling someone else's API. Your own hardware or an environment managed for you, either works.

Most workloads are fine on a third-party API. The ones that aren't tend to share a few traits: the data can't leave, the model needs to reach internal systems, or a regulator needs to see how it runs. If two of those apply, you're looking at a self-hosting candidate.

We mapped 29 of those workloads across six patterns. Below is the decision framework.

01 · Workload Atlas

Which AI workloads fit self-hosted inference

Not every AI workload needs its own infrastructure. These six patterns are the ones where it tends to matter — because of the data involved, the systems the model needs to reach, or what you need to prove after the fact.

Workload pattern · 4 workloads

Sensitive data that can't leave

Workloads where files, conversations, or records are too sensitive to send through a third-party API.

Routing prompts through a third-party endpoint creates a boundary problem. Self-hosting moves the model inside the perimeter.

Multilingual audio analytics

Build a multilingual analytics layer across enterprise conversations.

Batch transcription pipelines

Transcribe sensitive audio files at scale into timestamped, system-ready text.

Structured outputs from meetings

Extract actions, questions, and mentions from meeting transcripts.

Document Q&A over contracts, policies, and case files

Query internal case and policy documents with grounded answers and linked sources.

Workload pattern · 5 workloads

Write back to ERP, CRM, and internal tools

The model creates, updates, or enriches records in the systems the business already runs on.

The model must run where the systems of record are reachable. API-hosted models cannot reach internal tools without proxy layers that often recreate the hosting problem they were meant to avoid.

Contract records from incoming legal documents

Extract key fields, clause flags, and review notes from incoming contracts into the contract system.

Write purchase inbox packets into ERP records

Read vendor emails, invoices, and purchase documents, then create payable drafts and exception records in ERP.

Write CRM case summaries from multilingual support calls

Turn multilingual support calls into CRM-ready summaries, labels, and next-step records.

Turn support into engineering tickets

Convert support threads, logs, and screenshots into structured engineering tickets with evidence and repro context.

Pre-populate case records from submitted materials

Read submitted forms and attachments, then prefill draft case records before staff review.

Workload pattern · 6 workloads

Background jobs that run on a schedule

Repeatable jobs that run at scale in the background as part of day-to-day operations.

Steady, predictable, invisible to end users. Often the strongest self-hosting case because flat utilization makes dedicated infrastructure a natural fit. The operations team manages them like any other batch job.

Batch transcription pipelines

Transcribe sensitive audio files at scale into timestamped, system-ready text.

Nightly document classification and indexing

Classify and index newly received documents each night into searchable categories and queues.

Screenshot and evidence indexing

Turn screenshots, scans, and attached evidence into searchable records with tags and summaries.

Index screenshots by visible content

Index screenshots by the screens, errors, labels, and UI states they show, not by filename.

Dependencies into upgrade plans

Read dependency files and changelogs, then produce upgrade plans with risk notes and ordered tasks.

Detect deviations in recurring visual process outputs

Compare recurring inspection photos, generated reports, or production screenshots against a known baseline and flag meaningful changes.

Workload pattern · 5 workloads

Internal copilots and assistants

Assistants that help employees inside approved tools, with defined context, actions, and limits.

Not open-ended chatbots. Scoped tools with approved sources and allowed actions. Self-hosting matters because they read internal systems, act inside internal tools, and their behavior needs to be auditable.

Internal assistant in chat workflow

Answer staff requests inside chat by reading approved sources and taking allowed actions in internal systems.

Meeting co-pilot with action extraction

Turn meeting audio into action items, decisions, owners, and follow-up records.

Service desk assistant with tool-backed lookup

Resolve internal service requests by checking approved systems and returning grounded answers or next steps.

Internal legal / compliance assistant over approved sources

Answer internal legal and compliance questions using only approved policies, contracts, and guidance.

Code and engineering copilots behind the boundary

Assist engineers with code, CI output, and internal tools without sending repo context outside the company boundary.

Workload pattern · 5 workloads

Cross-check documents against each other

The model checks multiple inputs against each other to find matches, gaps, or contradictions.

These workloads compare multiple internal sources against each other. They need access to several systems simultaneously and produce judgment calls someone will act on. Running them externally means sending paired internal documents outside the boundary, which compounds the sensitivity.

Compare invoice vs PO vs goods receipt

Check whether invoice values match the purchase order and goods receipt before payment is approved.

Compare submitted forms vs attached evidence

Check whether submitted forms are supported by the attached files, scans, or photos.

Compare contract clauses vs policy baseline

Flag contract clauses that differ from the company's standard legal or compliance baseline.

Compare meeting transcript vs CRM notes

Check whether CRM notes accurately reflect what was said on the call or in the meeting.

Flag contradictions between images and written records

Detect when site photos, damage photos, or inspection images do not match the written report submitted with them.

Workload pattern · 5 workloads

Flag problems and escalate to people

The model does the first pass, then routes uncertain or unusual cases to a human review queue.

When the model hits something outside its confidence threshold, it routes the case to a human queue with evidence attached. The opposite of autonomous agents — these are triage systems. Self-hosting matters because the escalation path connects to internal review tools, and the audit trail stays inside the company.

Flag low-confidence OCR fields for review

Send uncertain OCR-extracted fields to a review queue before they are written into records.

Route ambiguous support cases to the right queue

Detect unclear support cases and route them to human triage instead of forcing a bad classification.

Detect compliance-risk phrases in calls and escalate

Find risky promises, restricted claims, or policy-sensitive language in calls and route them for review.

Surface contradictory document evidence

Highlight conflicts across related documents before a reviewer opens the case.

Create reviewer-ready exception packets with linked evidence

Assemble the failed case, the relevant source material, and the reason for escalation into one review packet.

02 · Why Now

Why private deployment is viable now

Open-source and open-weight models closed the benchmark gap from 8% to 1.7% in a single year. For many workloads, an open-weight model running inside your boundary now performs close enough to a frontier API that capability is no longer the deciding factor. SOURCE

82% of container users now run Kubernetes in production. 66% use it for generative AI inference. The infrastructure for running inference, as distinct from model training, is mature. Self-hosting conversations now sound less like model experiments and more like platform work: GPU scheduling, serving capacity, rollout strategy, observability, operational discipline. SOURCE

The shift in numbers

What changed across capability, infrastructure, and adoption

8% → 1.7%

benchmark gap reduction

Open-weight releases improved fast enough that proprietary stacks no longer hold a prohibitive edge on those benchmarks.

SOURCE

82%

run Kubernetes in production

Up from 66% in 2023—Kubernetes is now the default production substrate for container users, not a pilot.

SOURCE

66%

use Kubernetes for genAI inference

Inference runs on the same control planes teams already operate—rollout, capacity, and observability read as platform work, not a greenfield experiment.

SOURCE

providers on average

60% had switched LLM vendor in the last six months—multi-provider setups are normal, not an edge case.

SOURCE

60%

cite accuracy / hallucinations

Engineering teams rank wrong answers ahead of slow or expensive ones—latency and cost trail at 23% each in the same survey.

SOURCE

60%

use RAG / vector databases

Fine-tuning stayed in the single digits—most teams reach for retrieval and context before retraining.

SOURCE

Governance caught up too. The EU AI Act entered into force on 1 August 2024. GPAI obligations started applying 2 August 2025. The general application date is 2 August 2026. Combined with GDPR transfer constraints, running on-premise LLM inference inside a controlled environment is no longer just an engineering preference. For regulated workloads, it is becoming a requirement. SOURCE SOURCE+1

Regulation

EU AI Act timeline

The timeline is concrete enough to change deployment decisions.

Aug 2024

EU AI Act enters into force

Feb 2025

Rules on prohibited practices and AI literacy start applying

Aug 2025

GPAI obligations start applying

Aug 2026

General application date

Aug 1, 2024

EU AI Act enters into force

Feb 2, 2025

Rules on prohibited practices and AI literacy start applying

Aug 2, 2025

GPAI obligations start applying

Aug 2, 2026

General application date

The AI Act is an infrastructure decision

The AI Act splits organizations into providers and deployers. Providers of high-risk AI systems face conformity assessment, CE marking, technical documentation, and 10-year record-keeping. Deployers must log, monitor, and maintain human oversight — real work, but a different order of magnitude. SOURCE SOURCE

Which role you hold depends on what you do with the model. Run an open-weight model on your own infrastructure without materially changing it, and you stay on the deployer side. The Commission's GPAI guidance sets the boundary: you become a provider only when modification exceeds one-third of the original training compute. Standard customization sits well below that. SOURCE SOURCE+1

Customization and role boundary

How common customization techniques relate to provider status under the AI Act.
Technique	Triggers provider status?	Why
RAG	No	The model stays untouched.
Prompt engineering	No	It shapes the question, not the model.
Quantization	No	Compression is not retraining.
LoRA / lightweight fine-tuning	Unlikely	This is typically well below the one-third compute threshold.
Heavy fine-tuning	Possibly	This is the point where the threshold is worth checking carefully.

RAG-first with optional lightweight fine-tuning keeps most teams on the deployer side. SOURCE

Four obligations that need stack control

The Act never says "self-host." But four deployer obligations point there.

You must keep logs

Retain automatically generated logs for at least six months. On your cluster, that is a pipeline you configure. On an API, you get what the provider exposes.

You must govern your data

Where the deployer controls input data, it must be relevant and representative for the intended purpose. Self-hosting gives you the full path: what enters, how it is processed, what comes out.

You must be able to stop it

People must be able to understand output, reject it, and stop the system. Easier to build when you control the deployment.

You must monitor it

Monitor operation, respond without undue delay to risks or incidents. Self-hosted: direct visibility. API: outside view only.

The pattern across all four: the Act asks for capabilities that map to infrastructure control. A contract is a legal promise. Operating the stack is an operational fact. SOURCE

GDPR compounds it

When prompts contain personal data and reach a US-based API, GDPR Chapter V becomes an architecture constraint. The EU-US Data Privacy Framework survived its first court test in September 2025, but an appeal was brought in October and published in the Official Journal in December. The transfer basis is live but unsettled. SOURCE SOURCE+1 SOURCE+2

The logs the AI Act requires can themselves contain personal data — with their own lawful basis, retention policy, and access controls. Self-hosting on EU infrastructure removes the transfer layer entirely. Data that never leaves does not trigger it. SOURCE SOURCE+1

03 · Routing

Self-host, hybrid, or API

Not every workload maps cleanly. Some need frontier capability that only closed models provide today. Others combine sensitivity with burst demand. The routing logic is not absolute. It is a portfolio decision.

Self-host

Data or prompts must stay inside your boundary.

Need model to act inside systems of record / internal tools.

Traffic is steady/high-volume or latency-critical.

RAG over private docs is the main value.

Hybrid

Sensitive and repeatable paths deserve control.

Frontier-only capability or burst capacity still matters.

Provider switching is common enough that portability matters.

API

Fastest time-to-value matters most.

Frontier-only capability is the deciding factor.

Platform maturity is not there yet.

If your workload matches one of the patterns in the workload atlas, the next step is understanding which models fit it and what the infrastructure looks like. Explore the model pages to see what runs on self-hosted Kubernetes.

04 · Infrastructure

From inference engine to infrastructure

Self-hosted inference is a spectrum. At one end, a developer pulls a model and runs it on a single machine. At the other, a platform team operates multiple models across workloads on managed Kubernetes. Most of the distance between those two points is not about the model. It is about what sits around it.

The workloads here sit at the platform end. They need concurrent request handling, structured outputs, and observability on GPU hardware. That means a serving engine behind a layer that manages the traffic.

Ollama

Pull and run

1 user, 1 GPU

llama.cpp

Runs on anything

CPU, Metal, CUDA

vLLM

Handles real traffic

Concurrent users

SGLang

Structured output

Agent pipelines

KServe

Multi-model on K8s

Scale-to-zero, canary

Explore

Most production deployments start here

Orchestrate

Ollama

Pull and run

1 user, 1 GPU

llama.cpp

Runs on anything

CPU, Metal, CUDA

vLLM

Handles real traffic

Concurrent users

SGLang

Structured output

Agent pipelines

KServe

Multi-model on K8s

Scale-to-zero, canary

Explore

Most production deployments start here

Orchestrate

The serving engine is one layer. Production workloads also need logging pipelines, deployment controls, monitoring, rollout strategy, and audit trails that satisfy regulatory requirements. Those are not engine features. They are infrastructure features. They come from the orchestration layer: Kubernetes.

Explore

Ollama, llama.cpp. One person, one machine. Good for evaluating models. Not where production workloads run.

Serve

vLLM, SGLang. Inference engines on GPU hardware that handle concurrent users and batch jobs. Where most production deployments begin.

Orchestrate

Kubernetes. Multiple models, deployment controls, staged rollouts, observability, access management. Where running models becomes operating infrastructure.

The step from serve to orchestrate is the infrastructure decision. The team picks the model and the engine. The platform underneath is what turns those choices into workloads you can operate, monitor, and answer for.

Next step

See the infrastructure behind private deployment

The workloads above run on GPU Kubernetes clusters. The AI infrastructure page covers GPU options, deployment architecture, and how we manage the serving layer.

AI infrastructure

References

Stanford HAI

Stanford HAI, AI Index 2025

Open-weight benchmark-gap data and model capability trends.

CNCF

CNCF Annual Cloud Native Survey announcement

Kubernetes production adoption and generative AI inference workload usage.

Vercel

Vercel, State of AI

Multi-provider behavior, provider switching, technical challenges, and customization data.

European Commission

European Commission AI Act service desk, Article 113

AI Act timeline and staged applicability dates.

EDPB

European Data Protection Board, International data transfers

GDPR Chapter V framing for transfers of personal data outside the EEA.

EUR-Lex

Regulation (EU) 2024/1689 (AI Act)

Official AI Act text covering provider/deployer roles and high-risk system obligations.

European Commission

FAQs on the guidelines on obligations for providers of general-purpose AI models

Commission guidance on when downstream modification can make an actor a GPAI model provider.

EDPB

Opinion 28/2024 on certain data protection aspects related to the processing of personal data in the context of AI models

EDPB opinion on GDPR principles for AI models and downstream deployments.

CURIA

Press Release No 106/25: Latombe v Commission (T-553/23)

General Court press release of 3 September 2025 upholding the EU-US Data Privacy Framework decision.

EUR-Lex

Case C-703/25 P: Appeal brought by Philippe Latombe against the judgment in Case T-553/23

Official Journal notice confirming the pending appeal against the Latombe judgment.