fairlane.systems

KONG AI GATEWAY · TECH

Kong AI Gateway: Kubernetes-native API gateway with LLM plugins

Kong v3.8 extends the open-source API gateway with AI-Proxy, AI-Prompt-Guard, and semantic caching – self-host on Kubernetes or bare metal.

Researched & fact-checked by: · As of: 2026-05

What is Kong AI Gateway?

Kong (konghq.com) has been the widespread open-source API gateway in the cloud-native world since 2015. The engine is based on NGINX and OpenResty, written in Lua and Go, and is often used in Kubernetes clusters as an ingress controller. As of May 2026, the stable version is Kong Gateway 3.8 (LTS); the enterprise variant is called Kong Konnect.

The AI Gateway is not a separate component but a set of plugins that extend the existing gateway with LLM-specific capabilities. The most important plugins (as of May 2026): ai-proxy (routing to OpenAI, Anthropic, Azure, Cohere, Mistral, Llama, Hugging Face, AWS Bedrock), ai-prompt-guard (filter for prompt injection patterns and forbidden content), ai-prompt-decorator (system prompts and header injection before the upstream), ai-prompt-template (central template library), ai-request-transformer and ai-response-transformer (schema conversion between providers), ai-rate-limiting-advanced (token-based rate limit instead of request-based), ai-semantic-caching (cache hits based on embedding similarity rather than exact match), ai-azure-content-safety (integration with Azure content filter).

The licence is hybrid. Kong Gateway in the community variant is Apache-2.0; many AI plugins are included in the OSS variant (ai-proxy, ai-prompt-guard, ai-prompt-decorator). Semantic caching, ai-rate-limiting-advanced, and ai-azure-content-safety are enterprise plugins and require a Konnect licence. Licence costs are not public but range in the five-figure USD area per year depending on cluster size.

For fairlane.systems, Kong AI Gateway is interesting in two cases: first, for platform teams already running Kong as the standard gateway where LLM is just another backend route; second, for Kubernetes-native setups with high volume where the self-host LLM gateway must be an integral part of the platform.

Why it matters for platform teams

Three advantages make Kong AI Gateway attractive in certain setups. First: one gateway for everything. Anyone already running Kong for REST APIs, gRPC services, and WebSockets adds LLM routing as another route without building a second gateway layer. Authentication, rate limit, logging, tracing – everything runs through the same Kong pipeline. That cuts operational complexity considerably.

Second: Kubernetes-native integration. Kong is installed as a Kubernetes ingress controller or as an operator (Kong Gateway Operator); routes, plugins, consumers, virtual keys exist as Kubernetes CRDs. That fits GitOps practice: everything in Git, everything deployed via kubectl apply or Argo CD, everything versioned in Helm. For teams that manage their platform fully declaratively, that is the natural choice.

Third: hard performance. Kong is based on NGINX/OpenResty and is optimised for high load. At a Swiss client with 5,000 LLM calls per hour, the Kong overhead is under 5 ms per request – comparable to Bifrost and clearly better than Python-based gateways. At a platform with 100+ clients, that performance pays off.

Under the revised FADP, Kong AI Gateway is well positioned. Since the gateway is fully self-hostable (bare metal, Kubernetes, Docker), prompts and responses only leave your own infrastructure toward the configured upstream LLMs. With upstreams in EU/CH (Mistral La Plateforme, Azure OpenAI EU, local Ollama), the data flow chain is fully under your own control. Audit logs run via Kong plugins (file-log, http-log, syslog) into any backend – PostgreSQL, Loki, Elasticsearch.

How it works

The architecture follows the classic Kong pattern. A service definition points to the upstream LLM (e.g. https://api.mistral.ai), a route defines the incoming path (e.g. /v1/chat/completions), and plugins attach either to the service, to the route, or globally. For LLM routing, the ai-proxy plugin is activated on the service, transforming the OpenAI-schema request per provider.

Example: Kong should receive requests at /llm/v1/chat/completions and route them by header or path to Mistral, Anthropic, or local Ollama. The configuration via declarative YAML (decK or Kong Gateway Operator):

services: - name: mistral-eu url: https://api.mistral.ai routes: - name: chat-mistral paths: [/llm/mistral] plugins: - name: ai-proxy config: route_type: llm/v1/chat auth: header_name: Authorization header_value: Bearer ${MISTRAL_API_KEY} model: provider: mistral name: mistral-large-2411

Consumers receive credentials (Key-Auth, JWT, OAuth2) and are limited via the ai-rate-limiting-advanced plugin to token budgets. The plugin counts not requests but input and output tokens – fitting LLM billing.

The ai-prompt-guard plugin runs as a pre-filter. It checks the prompt against regex or word lists (e.g. for prompt injection patterns like "ignore previous instructions") and can block the request or write an audit log entry. The enterprise variant adds Azure Content Safety for LLM-based content classification.

Semantic caching (enterprise plugin) stores prompt embeddings in Redis or a vector database. For a new prompt, the embedding is computed and compared against the cache – at a similarity above a threshold (e.g. 0.95 cosine), the cached response is returned. This cuts cost and latency dramatically for recurring requests.

Kong AI setup in 5 steps

  1. 01Install Kong Gateway 3.8 as a Helm chart or Docker Compose, configure Postgres or DB-less mode.
  2. 02Define services and routes for each LLM provider (e.g. mistral-eu, anthropic, ollama-local) and activate the ai-proxy plugin.
  3. 03Create consumers, issue key-auth credentials, configure ai-rate-limiting-advanced on token budgets.
  4. 04Activate the ai-prompt-guard plugin globally, store regex lists for prompt injection patterns, write audit logs via http-log plugin into Loki.
  5. 05Switch applications: base_url=https://kong.intern/llm/v1, api_key=consumer key, model selection via route or header; test, then roll out.

When Kong AI Gateway fits

First, when Kong is already running as an API gateway. In that case the AI Gateway is a small extension, not a new layer. An existing Kong installation with 30 services gets three LLM services added – operations overhead stays minimal.

Second, for Kubernetes platform teams. Anyone working GitOps-style who wants to manage everything as CRDs benefits from the Kong Gateway Operator. Routes, plugins, consumers live as YAML in Git, get deployed via Argo CD or Flux, and are reproducible. A Python-based gateway like LiteLLM fits this workflow less well.

Third, in multi-tenant platforms with token-based billing. The ai-rate-limiting-advanced plugin counts tokens per consumer and can enforce hard budgets per client. Combined with the logging plugins, it forms a complete billing foundation per client.

Fourth, at high volume with latency demands. Kong with OpenResty/NGINX delivers consistently under 5 ms overhead per request, even at thousands of requests per second. Python gateways scale worse and need more hardware for the same load.

Fifth, when procurement demands open source. The community edition is Apache-2.0; many core AI plugins are included in the OSS variant. Enterprise plugins can be retrofitted later when semantic cache or advanced rate limits are needed.

When not to use

First, in small setups without Kubernetes. Kong can run as a Docker container or bare-metal service, but its sweet spot is the Kubernetes world. For an SME with a single VM and three applications, LiteLLM is easier to operate.

Second, when the team lacks NGINX/Lua knowledge. Writing custom plugins or debugging existing ones requires knowledge of OpenResty/Lua or Go (for Kong PDK Go). Anyone at home in Python stacks makes faster progress with LiteLLM or Helicone.

Third, when prompt versioning and A-B testing are central. Kong AI Gateway has the ai-prompt-template plugin but no full-blown prompt repository like Portkey or Langfuse. Anyone maintaining prompts as versioned artefacts with eval sets should run Portkey or Langfuse in parallel.

Fourth, when semantic cache is mandatory but the budget does not cover an enterprise licence. The ai-semantic-caching plugin is enterprise-only – missing in the OSS variant. Building a custom semantic cache is feasible but effort; LiteLLM with Redis cache or Portkey Cloud are alternatives.

Fifth, when multi-LLM model routing by answer quality is requested (Martian style). Kong routes to configured models without classifying content. Anyone wanting to pick the best model per request automatically needs a specialised router.

Trade-offs

STRENGTHS

  • Kubernetes-native with CRDs, Helm chart, and operator – fits GitOps workflows
  • Very low latency overhead via the NGINX/OpenResty base (under 5 ms typical)
  • Unified gateway stack: REST, gRPC, and LLM run through one pipeline
  • OSS plugins (Apache-2.0) for routing, prompt guard, and schema conversion included

WEAKNESSES

  • Steep learning curve in Lua/OpenResty for custom plugins or debug sessions
  • Semantic cache and advanced rate limits only in the paid enterprise variant
  • No full-blown prompt repository with versioning and A-B tests
  • Overkill for small single-VM setups without a Kubernetes stack

FAQ

Which AI plugins are OSS and which are enterprise?

OSS (Apache-2.0): ai-proxy, ai-prompt-guard, ai-prompt-decorator, ai-prompt-template, ai-request-transformer, ai-response-transformer. Enterprise (Konnect licence): ai-rate-limiting-advanced, ai-semantic-caching, ai-azure-content-safety, ai-aws-guardrails. The OSS plugins cover routing, basic filters, and schema conversion – usually enough for mandates without budget.

How high is the latency overhead?

Kong itself sits below 5 ms per request, even under load. With ai-prompt-guard and ai-rate-limiting-advanced active, the overhead rises to 5-10 ms. With semantic cache, it depends on the vector backend – Redis with RediSearch delivers in 3-8 ms, an external Qdrant instance in 10-20 ms. Overall Kong remains one of the fastest gateway options in the LLM space.

Does Kong AI Gateway work with local Ollama?

Yes. In the ai-proxy plugin you set provider: openai (because Ollama exposes an OpenAI-compatible API) and upstream_url to http://ollama:11434/v1. That lets you route a local Llama 3.3 70B or Mistral 7B behind Kong with the same authentication and logging setup as for cloud providers. A typical Swiss configuration mixes mistral-eu (cloud) plus local Ollama (PII-sensitive requests) behind a single Kong route.

Do I need Kong Konnect or is the community edition enough?

For pure LLM routing, prompt guard, and audit logging, the community edition is enough. Konnect adds multi-cluster management, a configuration UI, semantic cache, and advanced rate limits. For a single production installation with 1-3 LLM providers and under 50,000 calls/day, the community variant is sufficient; at multiple clusters or high volume with cache demand, Konnect becomes interesting.

Related topics

LITELLM · TECHLiteLLM: one gateway for 100+ LLM providers behind a single APILLM GATEWAYS · COMPARISONLLM gateways compared: 10 options for routing, audit, and cost controlMULTI-LLM GATEWAY · SERVICEMulti-LLM Gateway: eight providers, one entry point, compliance routingROUTING · AI CONCEPTMulti-LLM routing: which model when, for how muchDOCKER · TECH STACKDocker orchestration for SMEs: docker-compose without Kubernetes overkillAUDIT TRAIL · AI CONCEPTAI audit trail design: what to log so an AI answer stays audit-readySELF-HOSTED VS. CLOUD · AI CONCEPTSelf-hosted vs. cloud LLM: a decision framework for SMEs and fiduciaries

Sources

  1. Kong AI Gateway Documentation – plugins, routes, providers · 2026-05
  2. Kong Gateway 3.8 Release Notes – new AI plugins and improvements · 2026-04
  3. Kong AI Plugin reference – ai-prompt-guard, ai-rate-limiting-advanced, ai-semantic-caching · 2026-05
  4. Kong Gateway Operator – Kubernetes CRDs for AI routes · 2026-03

FITS YOUR STACK?

What this looks like in your business – a 30-minute intro call.

Book a call