MANAGED · SERVICE
Managed Service & Monitoring: we keep it running, you use it
Monitoring, updates, security patches, incident response. Three tiers: Basic CHF 600/mo, Pro CHF 1,200/mo, Plus CHF 2,200/mo. Defined response times.
Researched & fact-checked by: DuneDive LLC · As of: 2026-05
What is managed service?
Managed service means someone takes responsibility for the running operation of your AI infrastructure. Concretely: monitoring (what is running, what is not), updates (security patches, version updates of the tools), backups (daily, verified), incident response (someone steps in when something breaks), reporting (what happened this month, what did LLM calls cost).
We have run our own infrastructure in production since early 2024 – 25 containers, 24 LLM models, 21 n8n workflows, without downtime. The stack we set up for you is the same we know. That is not trivial: AI stacks have more moving parts than classic web applications. Vector database, embedding models, model providers with unstable APIs, n8n workflows with external trigger dependencies – many places where something can break quietly.
As a service from us: three tiers with clear response times and defined scope. Basic CHF 600 per month for small setups (1–2 servers, 1 use case). Pro CHF 1,200 per month for typical fiduciary setups (3–5 servers, multiple workflows, gateway). Plus CHF 2,200 per month for larger environments with 24/7 response. All tiers can be terminated monthly – no minimum term.
Why it matters
An AI stack that is not maintained decays faster than classic software. Three reasons.
First: security. n8n, Postgres, Docker, Nginx, the Linux OS – all receive security patches weekly to monthly. Without patching, in 6 months you have an exploitable system. CrowdSec, Fail2ban, firewall rules are not configured once and forgotten but reactive to new threat patterns.
Second: provider instability. LLM providers change APIs, deprecate models, raise prices, block regions. OpenAI has changed model names three times in the past 18 months. Anthropic deprecated Claude 3 Sonnet in early 2026. Whoever does not track this actively suddenly has workflows that silently stop returning answers.
Third: quiet defects. n8n workflows often break not loudly but silently. An inbox account rotates its OAuth token, the polling workflow appears to keep running but emits no more triggers. An embedding pipeline indexes nothing from day X because a source folder was renamed. Without active monitoring such defects are noticed weeks later – when the bookkeeper notices no new invoices arrive.
The Pro tier delivers for this: Glitchtip for application errors, Grafana for metrics (CPU, disk, latencies), Prometheus for collection, Loki for logs, Uptime-Kuma for end-to-end health checks, CrowdSec for anomaly detection, Telegram for alerts. Every one of these is open source – no vendor lock-in. We run the same stack for ourselves.
How we run it
The service has five pillars: onboarding, monitoring stack, patch routine, incident response, reporting.
Onboarding (one-off, included in the first month): we come into your environment, document the current state, check backups, harden servers (firewall, SSH keys, Fail2ban), set up the monitoring stack. Duration: 2–5 days depending on complexity.
Monitoring stack: Grafana as dashboard frontend, Prometheus for metrics (node-exporter per server, cAdvisor for containers, Postgres-exporter), Loki for structured logs (n8n, LiteLLM, bot services), Glitchtip for application errors (Sentry-compatible), Uptime-Kuma for external health checks (every service probed every 30 seconds). CrowdSec on every server for anomaly detection (bruteforce, scans, unusual login patterns). Alerts go to Telegram into a channel you can follow.
Patch routine: OS security patches are checked weekly, critical CVEs applied within 24 hours. Tool updates (n8n, Postgres, Docker images) quarterly, with prior test in a staging environment. Database migrations run idempotent and reversible.
Incident response: Basic tier responds within 24 hours during business hours (Mon–Fri 08–18 Swiss time). Pro tier responds within 4 hours during business hours and 24 hours outside. Plus tier responds 24/7 within 4 hours – on-call rotation. Response means: acknowledge on Telegram, analysis in Loki, fix or mitigation.
Reporting: quarterly (Basic) or monthly (Pro/Plus) a Markdown report delivers: uptime per service, incidents with root cause, LLM costs broken down, patches applied, open items, recommendations. Quarterly there is a 60-minute review with your management.
We use no proprietary closed-source tool. If you terminate, we hand over Grafana dashboards, Prometheus configurations, Loki rules, runbooks – all as Markdown and YAML in your Git repo.
From contract to routine
- 01Inventory day: we enter your environment, document stack, credentials, backups, risks. Output: a PDF you can use independently.
- 02Onboarding (2–5 days): set up monitoring stack (Grafana, Prometheus, Loki, Uptime-Kuma, CrowdSec, Glitchtip). Hardening (firewall, SSH, Fail2ban). Configure Telegram channel for alerts.
- 03Build runbook: per service one Markdown document with purpose, dependencies, common failures, on-call response. Into Git repo.
- 04Routine starts: weekly patch checks, monthly security patches, quarterly tool updates in staging before production.
- 05Incident response: Telegram alert → acknowledge → Loki analysis → fix or mitigation → post-mortem in Git if larger.
- 06Reporting: monthly (Pro/Plus) or quarterly (Basic) Markdown report with uptime, incidents, LLM costs, recommendations.
- 07Quarterly review: 60 minutes with management. What works, what does not, what comes next quarter.
When to use
Managed service is the right choice when (a) the AI stack is business-critical (workflows without which something gets stuck), (b) you have no own DevOps team using these tools daily, (c) you need defined response times.
Concrete constellations where we recommend managed service: fiduciary practice with n8n workflows for invoice triage and mail routing – when that hangs, clients suffer. SME with RAG knowledge base and WhatsApp bot – the bot must answer, otherwise the customer feels abandoned. Law firm with multi-LLM gateway – lawyers expect requests to work without knowing whether OpenAI is having issues.
The three tiers in detail:
Basic CHF 600/month: Monitoring stack running, security patches monthly, one quarterly report, 24h response in business hours. For setups with one server and one use case. Typical: a practice bot, a single n8n workflow stack.
Pro CHF 1,200/month: Like Basic plus monthly reporting, 4h response in business hours and 24h outside, quarterly review meeting, tool updates applied. For typical fiduciary setups with 3–5 servers and multiple workflows.
Plus CHF 2,200/month: Like Pro plus 24/7 response within 4 hours, monthly review meeting, performance tuning, dedicated point of contact. For larger environments or business-critical AI workflows with external impact.
All tiers include advance consultation on larger changes. Anyone wanting to add a new model, build a new workflow, plan a migration – we review upfront and say yes, no, or "better this way".
When not to use
Managed service is the wrong choice when you want to run everything yourself or can already. We hand over fully documented in every project – anyone with a DevOps team working daily with Grafana, Prometheus, Loki can continue the stack without us. We actively recommend that for clients who have the competence in-house.
The service is also wrong for setups too small. Whoever runs a single n8n workflow copying a file once a day does not need CHF 600/month managed service. A Telegram alert on the cron run is enough. We actively decline such mandates instead of selling them – the complexity does not carry the cost.
Be careful with very individual stacks we would not run ourselves. If you build on an unusual vector database or an internal proprietary tool we do not know, we cannot seriously promise 4h response. In such cases we first plan consulting days until we understand the stack, or refer you to a more fitting provider.
Managed service is not suitable when the organisational prerequisites are missing. We need a contact person on your side reachable within the response time (for actions requiring approval – database migration, major updates). Without that person the service cannot work.
Trade-offs
STRENGTHS
- Defined response times – 24h, 4h, or 4h-24/7 per tier
- Monitoring stack fully open source – no vendor lock-in
- We run the same stack for ourselves – experience, not theory
- Monthly termination, full handover on exit
WEAKNESSES
- Fixed monthly cost – not economical for very small setups
- We respond to incidents but do not replace internal IT knowledge long-term
- New features are project work, not in the service flat fee
- Service depends on a reachable contact person on your side – duty, not option
FAQ
What happens when an LLM provider fails?
The LiteLLM gateway has fallback rules configured per model class. If OpenAI fails for GPT-4o, the gateway routes to Claude-Sonnet. If Anthropic fails, to Mistral Large. We receive the alert from Glitchtip, verify the fallback engages cleanly, inform you. For longer outages with quality impact: Telegram to management with recommendation.
Can we see the monitoring stack ourselves?
Yes, Grafana runs on your own infrastructure, we grant your management and IT contact access (read-only or edit, as preferred). Uptime-Kuma has a public status page you can embed on your domain (status.company.ch). Telegram alerts flow into a shared channel – you can follow without being required to act.
What is not included in managed service?
Three things. First: new features or workflows – that is project work on time-and-material, not service. Second: major architecture changes (switch of vector database, new cloud provider) – project work. Third: end-customer support for your clients – we maintain infrastructure, not end users. We can train your support staff though.
Can I terminate the service at any time?
Yes. Monthly notice, no minimum term. On termination we hand over everything within 14 days: Grafana dashboards, Prometheus configs, Loki rules, runbooks, credentials index. You then have a documented system your team or another provider can continue.