8% → 1.7%
benchmark gap reduction
Open-weight releases improved fast enough that proprietary stacks no longer hold a prohibitive edge on those benchmarks.
Insight
Running a private LLM used to be a compromise. The models were worse, the tooling was rough, and the only reason to do it was compliance. That changed. Open-weight models now perform within 2% of closed APIs on most benchmarks. Kubernetes runs 66% of production inference workloads. The serving engines are real. Private deployment is a genuine engineering choice, not a fallback.
A private large language model (LLM) is one that runs on infrastructure you control instead of
calling someone else's API. Your own hardware or an environment managed for you, either works.
Most workloads are fine on a third-party API. The ones that aren't tend to share a
few traits: the data can't leave, the model needs to reach internal systems, or a regulator
needs to see how it runs. If two of those apply, you're looking at a self-hosting candidate.
We mapped 29 of those workloads across six patterns. Below is the decision
framework.
01 · Workload Atlas
Not every AI workload needs its own infrastructure. These six patterns are the ones where it tends to matter — because of the data involved, the systems the model needs to reach, or what you need to prove after the fact.
Workload pattern · 4 workloads
Workloads where files, conversations, or records are too sensitive to send through a third-party API.
Routing prompts through a third-party endpoint creates a boundary problem. Self-hosting moves the model inside the perimeter.
Multilingual audio analytics
Build a multilingual analytics layer across enterprise conversations.
Batch transcription pipelines
Transcribe sensitive audio files at scale into timestamped, system-ready text.
Structured outputs from meetings
Extract actions, questions, and mentions from meeting transcripts.
Document Q&A over contracts, policies, and case files
Query internal case and policy documents with grounded answers and linked sources.
Workload pattern · 5 workloads
The model creates, updates, or enriches records in the systems the business already runs on.
The model must run where the systems of record are reachable. API-hosted models cannot reach internal tools without proxy layers that often recreate the hosting problem they were meant to avoid.
Contract records from incoming legal documents
Extract key fields, clause flags, and review notes from incoming contracts into the contract system.
Write purchase inbox packets into ERP records
Read vendor emails, invoices, and purchase documents, then create payable drafts and exception records in ERP.
Write CRM case summaries from multilingual support calls
Turn multilingual support calls into CRM-ready summaries, labels, and next-step records.
Turn support into engineering tickets
Convert support threads, logs, and screenshots into structured engineering tickets with evidence and repro context.
Pre-populate case records from submitted materials
Read submitted forms and attachments, then prefill draft case records before staff review.
Workload pattern · 6 workloads
Repeatable jobs that run at scale in the background as part of day-to-day operations.
Steady, predictable, invisible to end users. Often the strongest self-hosting case because flat utilization makes dedicated infrastructure a natural fit. The operations team manages them like any other batch job.
Batch transcription pipelines
Transcribe sensitive audio files at scale into timestamped, system-ready text.
Nightly document classification and indexing
Classify and index newly received documents each night into searchable categories and queues.
Screenshot and evidence indexing
Turn screenshots, scans, and attached evidence into searchable records with tags and summaries.
Index screenshots by visible content
Index screenshots by the screens, errors, labels, and UI states they show, not by filename.
Dependencies into upgrade plans
Read dependency files and changelogs, then produce upgrade plans with risk notes and ordered tasks.
Detect deviations in recurring visual process outputs
Compare recurring inspection photos, generated reports, or production screenshots against a known baseline and flag meaningful changes.
Workload pattern · 5 workloads
Assistants that help employees inside approved tools, with defined context, actions, and limits.
Not open-ended chatbots. Scoped tools with approved sources and allowed actions. Self-hosting matters because they read internal systems, act inside internal tools, and their behavior needs to be auditable.
Internal assistant in chat workflow
Answer staff requests inside chat by reading approved sources and taking allowed actions in internal systems.
Meeting co-pilot with action extraction
Turn meeting audio into action items, decisions, owners, and follow-up records.
Service desk assistant with tool-backed lookup
Resolve internal service requests by checking approved systems and returning grounded answers or next steps.
Internal legal / compliance assistant over approved sources
Answer internal legal and compliance questions using only approved policies, contracts, and guidance.
Code and engineering copilots behind the boundary
Assist engineers with code, CI output, and internal tools without sending repo context outside the company boundary.
Workload pattern · 5 workloads
The model checks multiple inputs against each other to find matches, gaps, or contradictions.
These workloads compare multiple internal sources against each other. They need access to several systems simultaneously and produce judgment calls someone will act on. Running them externally means sending paired internal documents outside the boundary, which compounds the sensitivity.
Compare invoice vs PO vs goods receipt
Check whether invoice values match the purchase order and goods receipt before payment is approved.
Compare submitted forms vs attached evidence
Check whether submitted forms are supported by the attached files, scans, or photos.
Compare contract clauses vs policy baseline
Flag contract clauses that differ from the company's standard legal or compliance baseline.
Compare meeting transcript vs CRM notes
Check whether CRM notes accurately reflect what was said on the call or in the meeting.
Flag contradictions between images and written records
Detect when site photos, damage photos, or inspection images do not match the written report submitted with them.
Workload pattern · 5 workloads
The model does the first pass, then routes uncertain or unusual cases to a human review queue.
When the model hits something outside its confidence threshold, it routes the case to a human queue with evidence attached. The opposite of autonomous agents — these are triage systems. Self-hosting matters because the escalation path connects to internal review tools, and the audit trail stays inside the company.
Flag low-confidence OCR fields for review
Send uncertain OCR-extracted fields to a review queue before they are written into records.
Route ambiguous support cases to the right queue
Detect unclear support cases and route them to human triage instead of forcing a bad classification.
Detect compliance-risk phrases in calls and escalate
Find risky promises, restricted claims, or policy-sensitive language in calls and route them for review.
Surface contradictory document evidence
Highlight conflicts across related documents before a reviewer opens the case.
Create reviewer-ready exception packets with linked evidence
Assemble the failed case, the relevant source material, and the reason for escalation into one review packet.
02 · Why Now
Open-source and open-weight models closed the benchmark gap from 8% to 1.7% in a single year. For many workloads, an open-weight model running inside your boundary now performs close enough to a frontier API that capability is no longer the deciding factor. SOURCE
82% of container users now run Kubernetes in production. 66% use it for generative AI inference. The infrastructure for running inference, as distinct from model training, is mature. Self-hosting conversations now sound less like model experiments and more like platform work: GPU scheduling, serving capacity, rollout strategy, observability, operational discipline. SOURCE
The shift in numbers
What changed across capability, infrastructure, and adoption
8% → 1.7%
benchmark gap reduction
Open-weight releases improved fast enough that proprietary stacks no longer hold a prohibitive edge on those benchmarks.
82%
run Kubernetes in production
Up from 66% in 2023—Kubernetes is now the default production substrate for container users, not a pilot.
66%
use Kubernetes for genAI inference
Inference runs on the same control planes teams already operate—rollout, capacity, and observability read as platform work, not a greenfield experiment.
2
providers on average
60% had switched LLM vendor in the last six months—multi-provider setups are normal, not an edge case.
60%
cite accuracy / hallucinations
Engineering teams rank wrong answers ahead of slow or expensive ones—latency and cost trail at 23% each in the same survey.
60%
use RAG / vector databases
Fine-tuning stayed in the single digits—most teams reach for retrieval and context before retraining.
Governance caught up too. The EU AI Act entered into force on 1 August 2024. GPAI obligations started applying 2 August 2025. The general application date is 2 August 2026. Combined with GDPR transfer constraints, running on-premise LLM inference inside a controlled environment is no longer just an engineering preference. For regulated workloads, it is becoming a requirement. SOURCESOURCE+1
Regulation
The timeline is concrete enough to change deployment decisions.
Aug 2024
EU AI Act enters into force
Feb 2025
Rules on prohibited practices and AI literacy start applying
Aug 2025
GPAI obligations start applying
Aug 2026
General application date
Aug 1, 2024
EU AI Act enters into force
Feb 2, 2025
Rules on prohibited practices and AI literacy start applying
Aug 2, 2025
GPAI obligations start applying
Aug 2, 2026
General application date
The AI Act splits organizations into providers and deployers. Providers of high-risk AI systems face conformity assessment, CE marking, technical documentation, and 10-year record-keeping. Deployers must log, monitor, and maintain human oversight — real work, but a different order of magnitude. SOURCE SOURCE
Which role you hold depends on what you do with the model. Run an open-weight model on your own infrastructure without materially changing it, and you stay on the deployer side. The Commission's GPAI guidance sets the boundary: you become a provider only when modification exceeds one-third of the original training compute. Standard customization sits well below that. SOURCESOURCE+1
Customization and role boundary
| Technique | Triggers provider status? | Why |
|---|---|---|
| RAG | No | The model stays untouched. |
| Prompt engineering | No | It shapes the question, not the model. |
| Quantization | No | Compression is not retraining. |
| LoRA / lightweight fine-tuning | Unlikely | This is typically well below the one-third compute threshold. |
| Heavy fine-tuning | Possibly | This is the point where the threshold is worth checking carefully. |
RAG-first with optional lightweight fine-tuning keeps most teams on the deployer side. SOURCE
The Act never says "self-host." But four deployer obligations point there.
You must keep logs
Retain automatically generated logs for at least six months. On your cluster, that is a pipeline you configure. On an API, you get what the provider exposes.
You must govern your data
Where the deployer controls input data, it must be relevant and representative for the intended purpose. Self-hosting gives you the full path: what enters, how it is processed, what comes out.
You must be able to stop it
People must be able to understand output, reject it, and stop the system. Easier to build when you control the deployment.
You must monitor it
Monitor operation, respond without undue delay to risks or incidents. Self-hosted: direct visibility. API: outside view only.
The pattern across all four: the Act asks for capabilities that map to infrastructure control. A contract is a legal promise. Operating the stack is an operational fact. SOURCE
When prompts contain personal data and reach a US-based API, GDPR Chapter V becomes an architecture constraint. The EU-US Data Privacy Framework survived its first court test in September 2025, but an appeal was brought in October and published in the Official Journal in December. The transfer basis is live but unsettled. SOURCESOURCE+1SOURCE+2
The logs the AI Act requires can themselves contain personal data — with their own lawful basis, retention policy, and access controls. Self-hosting on EU infrastructure removes the transfer layer entirely. Data that never leaves does not trigger it. SOURCESOURCE+1
03 · Routing
Not every workload maps cleanly. Some need frontier capability that only closed models provide today. Others combine sensitivity with burst demand. The routing logic is not absolute. It is a portfolio decision.
Self-host
Data or prompts must stay inside your boundary.
Need model to act inside systems of record / internal tools.
Traffic is steady/high-volume or latency-critical.
RAG over private docs is the main value.
Hybrid
Sensitive and repeatable paths deserve control.
Frontier-only capability or burst capacity still matters.
Provider switching is common enough that portability matters.
API
Fastest time-to-value matters most.
Frontier-only capability is the deciding factor.
Platform maturity is not there yet.
If your workload matches one of the patterns in the workload atlas, the next step is understanding which models fit it and what the infrastructure looks like. Explore the model pages to see what runs on self-hosted Kubernetes.
04 · Infrastructure
Self-hosted inference is a spectrum. At one end, a developer pulls a model and runs it on a single machine. At the other, a platform team operates multiple models across workloads on managed Kubernetes. Most of the distance between those two points is not about the model. It is about what sits around it.
The workloads here sit at the platform end. They need concurrent request handling, structured outputs, and observability on GPU hardware. That means a serving engine behind a layer that manages the traffic.
Ollama
Pull and run
1 user, 1 GPU
llama.cpp
Runs on anything
CPU, Metal, CUDA
vLLM
Handles real traffic
Concurrent users
SGLang
Structured output
Agent pipelines
KServe
Multi-model on K8s
Scale-to-zero, canary
Explore
Most production deployments start here
Orchestrate
Ollama
Pull and run
1 user, 1 GPU
llama.cpp
Runs on anything
CPU, Metal, CUDA
vLLM
Handles real traffic
Concurrent users
SGLang
Structured output
Agent pipelines
KServe
Multi-model on K8s
Scale-to-zero, canary
Explore
Most production deployments start here
Orchestrate
The serving engine is one layer. Production workloads also need logging pipelines, deployment controls, monitoring, rollout strategy, and audit trails that satisfy regulatory requirements. Those are not engine features. They are infrastructure features. They come from the orchestration layer: Kubernetes.
Explore
Ollama, llama.cpp. One person, one machine. Good for evaluating models. Not where production workloads run.
Serve
vLLM, SGLang. Inference engines on GPU hardware that handle concurrent users and batch jobs. Where most production deployments begin.
Orchestrate
Kubernetes. Multiple models, deployment controls, staged rollouts, observability, access management. Where running models becomes operating infrastructure.
The step from serve to orchestrate is the infrastructure decision. The team picks the model and the engine. The platform underneath is what turns those choices into workloads you can operate, monitor, and answer for.
Next step
The workloads above run on GPU Kubernetes clusters. The AI infrastructure page covers GPU options, deployment architecture, and how we manage the serving layer.
References
Stanford HAI
Stanford HAI, AI Index 2025
Open-weight benchmark-gap data and model capability trends.
CNCF
CNCF Annual Cloud Native Survey announcement
Kubernetes production adoption and generative AI inference workload usage.
Vercel
Vercel, State of AI
Multi-provider behavior, provider switching, technical challenges, and customization data.
European Commission
European Commission AI Act service desk, Article 113
AI Act timeline and staged applicability dates.
EDPB
European Data Protection Board, International data transfers
GDPR Chapter V framing for transfers of personal data outside the EEA.
EUR-Lex
Regulation (EU) 2024/1689 (AI Act)
Official AI Act text covering provider/deployer roles and high-risk system obligations.
European Commission
FAQs on the guidelines on obligations for providers of general-purpose AI models
Commission guidance on when downstream modification can make an actor a GPAI model provider.
EDPB
Opinion 28/2024 on certain data protection aspects related to the processing of personal data in the context of AI models
EDPB opinion on GDPR principles for AI models and downstream deployments.
CURIA
Press Release No 106/25: Latombe v Commission (T-553/23)
General Court press release of 3 September 2025 upholding the EU-US Data Privacy Framework decision.
EUR-Lex
Case C-703/25 P: Appeal brought by Philippe Latombe against the judgment in Case T-553/23
Official Journal notice confirming the pending appeal against the Latombe judgment.