CLOUDFLARE AI GATEWAY · TECH

Cloudflare AI Gateway: edge proxy for OpenAI, Anthropic, Workers AI

Cloudflare AI Gateway runs on the Cloudflare edge, is free in the Workers plan, and bundles OpenAI, Anthropic, Mistral, Replicate, and Workers AI behind one API.

Researched & fact-checked by: DuneDive LLC · As of: 2026-05

What is Cloudflare AI Gateway?

Cloudflare AI Gateway (cloudflare.com/ai-gateway) is a cloud-only proxy running on the Cloudflare edge. The product was announced in 2023 and has been in GA since 2024. As of May 2026, supported upstream providers are: OpenAI, Anthropic, Mistral AI, Replicate, Cohere, Perplexity, Google AI Studio, Groq, DeepSeek, Workers AI (Cloudflare-native inference), Azure OpenAI, and Amazon Bedrock. Every request goes through a URL of the form https://gateway.ai.cloudflare.com/v1/{account_id}/{gateway_id}/openai/v1/chat/completions and is routed at the nearest Cloudflare PoP to the upstream.

The business model is tightly woven into the Cloudflare stack. Up to 100,000 requests per day, the gateway is free in the Workers Free plan; above that, requests count against the Workers Paid quota (USD 5/month for 10M requests). There is no separate AI Gateway licence – anyone with the Workers plan has the AI Gateway. That makes onboarding radically easy for setups already on Cloudflare.

The feature set focuses on four blocks. First, caching: every request can be marked with a cf-aig-cache header and cached in Cloudflare KV or D1; a second identical request gets the response in under 10 ms. Second, rate limiting: per gateway and per API token, requests per minute/hour/day can be limited. Third, analytics: the dashboard shows requests, tokens, cost, cache hit rate, and errors per model, provider, and application. Fourth, logging: requests and responses are stored up to 30 days (free) or exported via Logpush to your own object storage (R2, S3).

For fairlane.systems, Cloudflare AI Gateway is mainly relevant in two setups: first, for applications already running on Cloudflare Workers or Pages that should use deep edge cache; second, as a cost-tracking layer without self-host effort for setups where a self-host gateway is not wanted. For pure Swiss FADP-strict setups, the edge architecture is not optimal because routing is globally distributed.

Why it matters

Three properties explain the significance. First: zero-effort onboarding. Anyone with a Cloudflare account activates the AI Gateway in 30 seconds in the dashboard, copies the endpoint URL, and replaces the provider base URL in their application. No new server, no YAML configuration, no Docker Compose file. That cuts the entry barrier to a minimum – for prototypes and small pilot projects, this is the fastest path to cost tracking and caching.

Second: edge latency. Cloudflare runs PoPs in 300+ cities including Zurich, Geneva, Basel. From Switzerland the nearest Cloudflare PoP is typically reached in under 5 ms, the gateway routing itself costs 2-8 ms. In total, Cloudflare is the latency-lowest of the managed gateways. For voice bots and streaming applications already living in the Cloudflare world (Workers AI, Stream, Calls), that is a clear advantage.

Third: cache hit effect. For recurring requests (e.g. FAQ answers or public research templates), the cache delivers responses in under 10 ms and without provider token costs. A FAQ chatbot with a 30% cache hit rate cuts LLM cost by 30% and p95 latency by over 50% – a high-impact optimisation with minimal effort.

Under the revised Swiss FADP, Cloudflare AI Gateway must be assessed in nuance. Cloudflare offers an EU-region toggle (data processing only in EU PoPs), but routing to the upstream LLM still goes through US servers depending on the model (e.g. OpenAI). The gateway itself stores logs in EU or globally as configured. For client data under professional secrecy, Cloudflare AI Gateway is therefore only fitting when the upstream is an EU model (Mistral La Plateforme, Azure OpenAI EU) and the EU toggle is active. For open research setups, the solution is practical.

How it works

In the Cloudflare dashboard, under AI > AI Gateway, you create a new gateway; you assign an ID (e.g. fairlane-prod) and decide on logging (on/off) and region (US/EU/Global). The gateway has a URL of the form https://gateway.ai.cloudflare.com/v1/{account_id}/fairlane-prod. Behind this URL are sub-paths per upstream provider: /openai/v1, /anthropic/v1, /mistral/v1, /workers-ai/.

Application integration follows the OpenAI schema, only the base URL changes:

import openai client = openai.OpenAI( api_key=os.environ["OPENAI_API_KEY"], base_url="https://gateway.ai.cloudflare.com/v1/{account_id}/fairlane-prod/openai/v1" ) resp = client.chat.completions.create( model="gpt-4o", messages=[{"role":"user","content":"..."}], extra_headers={"cf-aig-cache-ttl": "3600", "cf-aig-metadata": "{\"user\": \"client-12\"}"} )

The cf-aig-cache-ttl header marks the response as cacheable for an hour; the cf-aig-metadata header attaches arbitrary metadata filterable in the analytics dashboard by client/application/function. Caching is exact (not semantic) – the same prompt text plus the same parameters yield a cache hit.

Fallback routing has been available as an experimental feature since 2025-Q4: a gateway can have a fallback list of models (e.g. anthropic/claude-opus-4.7 -> openai/gpt-4o -> mistral/mistral-large-2411); if the primary model returns 5xx or times out, the gateway jumps to the next. This feature is younger than the equivalent in LiteLLM or Portkey and was less flexible as of May 2026.

Logging data goes up to 30 days into Cloudflare storage (same plan); for longer retention, Logpush is activated – all request and response bodies are exported to R2 (Cloudflare object storage) or an external S3 bucket. For Art. 957a CO audit trails, Logpush plus R2 with versioning and object lock is a valid configuration.

Cloudflare AI Gateway setup in 5 steps

01Create a gateway in the Cloudflare dashboard under AI > AI Gateway, region EU, logging on, retention 30 days.
02Check provider sub-paths (/openai/v1, /anthropic/v1, /mistral/v1) and copy the endpoint URL into the application.
03Activate caching headers (cf-aig-cache-ttl) for FAQ and standard requests, dynamic requests without cache.
04Set metadata headers (cf-aig-metadata) per client/application – for analytics filters in the dashboard.
05Activate Logpush to R2 for audit trail, set Object Lock on the R2 bucket for WORM compliance.

When Cloudflare AI Gateway fits

First, for applications already running on Cloudflare. Anyone using Workers, Pages, R2, or D1 essentially gets the gateway for free – no extra infrastructure, no extra licence. Integration is deeper than with external gateways: Workers bindings allow calling directly from Worker code without an external HTTP request.

Second, for setups with a high cache share. FAQ chatbots, public research endpoints, tutorial answers, pre-configured templates – anything with repeatable prompts profits from the cache. 20-50% hit rate is achievable in practice and directly saves token cost.

Third, for multi-provider cost tracking without self-host. Anyone wanting cost visibility across OpenAI, Anthropic, and Mistral without running a LiteLLM server can use the Cloudflare gateway as a dashboard. The analytics UI is built in, Logpush exports the data to R2 or S3.

Fourth, for edge-near voice bots and streaming applications. Cloudflare Stream and Cloudflare Calls integrate deeply with AI Gateway; a voice bot running on Workers AI and routing to gateway.ai.cloudflare.com has end-to-end latency in single-digit milliseconds (local inference on the Cloudflare edge).

Fifth, for pilot and test phases with low volume. The free tier with 100,000 requests/day fully covers almost every pilot project. Only at scaling production load does the comparison with self-host alternatives pay off.

When not to use

First, with a hard self-host requirement. Cloudflare AI Gateway runs exclusively on Cloudflare infrastructure – there is no self-host mode, no on-premises deployment. Anyone who must keep all LLM requests on their own hardware (e.g. due to strict professional secrecy or public-sector requirements) is better served by LiteLLM, Kong, or Bifrost.

Second, with client data without an EU routing guarantee. By default, Cloudflare AI Gateway runs globally distributed. The EU region toggle restricts the Cloudflare edge, but upstream routing (OpenAI USA, Anthropic USA) remains global. Anyone allowed to send client data only to EU providers needs an explicit model whitelist – the Cloudflare gateway does not provide this depth.

Third, when semantic cache instead of exact cache is wanted. Cloudflare caches only identical prompts. Anyone who wants similar requests with the same content but slightly different wording to count as hits (typical for FAQ chatbots) needs Portkey, Helicone, or a custom Redis+embedding solution.

Fourth, in deeper compliance requirements with prompt versioning, eval sets, and audit-trail hash chain. Cloudflare AI Gateway is optimised for operations and cost, not for compliance workflows. Langfuse or Portkey cover this area considerably better.

Fifth, when the setup runs entirely outside Cloudflare. An on-premises pipeline on Hetzner dedicated with Mistral self-host and Postgres does not profit from the Cloudflare edge – latency would even be higher than with a local gateway.

Trade-offs

STRENGTHS

Zero-effort setup in the Cloudflare dashboard, free in the Workers plan up to 100k requests/day
Edge latency under 10 ms from Switzerland thanks to Cloudflare PoPs in Zurich/Geneva
Built-in cache, rate limit, analytics, and Logpush without self-host effort
Deep integration with Workers, Workers AI, R2, D1, and Stream

WEAKNESSES

No self-host and no on-premises deployment – cloud-only on Cloudflare infrastructure
Cache is exact (prompt hash), no semantic caching
Fallback routing is younger and less flexible than in LiteLLM or Portkey
No prompt repository with versioning and A-B tests

FAQ

What does Cloudflare AI Gateway cost?

Free up to 100,000 requests/day in the Workers Free plan. Above that in the Workers Paid plan (USD 5/month), 10M requests/month are included, every further request costs USD 0.30 per million. Logpush to R2 is extra (USD 0.05 per million requests for Logpush, plus R2 storage). Provider token costs run unchanged at the respective LLM provider – Cloudflare takes no markup.

How does the cache hit workflow look technically?

Cloudflare hashes prompt text, model name, and parameters to a cache key. With cf-aig-cache-ttl header active, Cloudflare stores the response in KV or the Cache API; a follow-up request with the same hash gets the cached response back in under 10 ms without an upstream call. For dynamic requests (e.g. chat with individual inputs), hit rate is typically under 5%; for FAQ and standard templates, it reaches 20-50%.

Does Cloudflare AI Gateway allow EU-only routing?

Cloudflare offers a region toggle that routes the gateway only via EU PoPs (Frankfurt, Amsterdam, Paris, Stockholm). Upstream routing – e.g. to OpenAI USA – remains global. A full EU guarantee additionally requires an upstream provider in the EU (Mistral La Plateforme, Azure OpenAI Frankfurt, Anthropic Claude on AWS Bedrock Frankfurt). Cloudflare handles region compliance for the gateway, not for the upstream.

How does Cloudflare AI Gateway integrate with Workers AI?

Workers AI is Cloudflares own inference platform (Llama 3.3, Mistral 7B, Stable Diffusion, Whisper, etc.). The AI Gateway routes requests at /workers-ai/ directly to Cloudflare inference; token costs go via the Workers AI budget, the gateway logs requests and caches responses. A voice bot with Whisper transcript plus Llama 3.3 response plus TTS in one Worker runs end-to-end under 200 ms latency.

Sources

FITS YOUR STACK?

What this looks like in your business – a 30-minute intro call.

Book a call