DUCKDB · TECH

DuckDB: embedded columnar OLAP database for local data analysis

DuckDB 1.x is stable in May 2026. MIT license, embedded, columnar, perfect for local data analysis -- Pandas replacement and fiduciary reporting tool.

Researched & fact-checked by: DuneDive LLC · As of: 2026-05

What is DuckDB?

DuckDB is an embedded columnar OLAP database, developed since 2018 by Mark Raasveldt and Hannes Muehleisen at CWI (Centrum Wiskunde & Informatica) in Amsterdam. Inspired by SQLite (single-file, no server) but columnar instead of row-oriented -- the comparison "SQLite for analytics" hits the core. In May 2026, version 1.2 is stable with improved Apache Arrow integration, vector-search extension, and Parquet compression optimizations.

License: MIT, fully open source, no license topics. DuckDB Labs (founded 2021) offers commercial support and MotherDuck (managed DuckDB Cloud) as hosting option, but the core stays MIT OSS without open-core model.

DuckDB runs as a library in Python, Node, R, Rust, Java, Go, C++. A DuckDB database is a file (.duckdb extension) or in-memory. Unlike ClickHouse, DuckDB needs no server -- it is embedded in the application, like SQLite. Unlike SQLite, it is columnar and optimized for analytical queries -- 50-100x faster than SQLite on aggregates.

DuckDB killer feature: reading external data formats. A DuckDB query can work directly on CSV, Parquet, JSON, Arrow, Iceberg, Delta Lake files -- locally on disk or via HTTP/S3 paths. SELECT * FROM 'https://example.com/data.parquet' works without prior import. This makes DuckDB the perfect analysis layer over existing data lakes or S3-based data.

In our stack we use DuckDB for ad-hoc analytics over fiduciary client snapshots (Postgres export as Parquet -> DuckDB query), for log analysis (Loki export as JSON -> DuckDB), and as Pandas replacement in Python scripts (100x faster on larger data). In May 2026, DuckDB is the de-facto standard library for local data analysis.

Why it matters

DuckDB has in May 2026 filled a gap that opened between SQLite (too small/too slow for analytics) and ClickHouse (too big/server-based). Three reasons why DuckDB is relevant in the Swiss SME fiduciary context.

First: local data analysis without server effort. A fiduciary analyst wants to analyse a client's booking data snapshot (5 GB CSV or Parquet) -- common booking accounts, monthly trends, anomaly detection. With Pandas and 5 GB, you fight OOM errors on a 16 GB laptop. With DuckDB in Python: SELECT statements over the same 5 GB in seconds, without out-of-memory.

Second: direct access to cloud data without ETL. SELECT * FROM 's3://bucket/fiduciary-export.parquet' WHERE account = 1000 -- DuckDB reads directly from S3, without prior import. This makes a fiduciary data warehouse possible without data movement: data lives in Hetzner Storage Box as Parquet, queries run locally from DuckDB.

Third: Pandas replacement in Python pipelines. Pandas is an industry standard, but 1) single-threaded, 2) RAM-hungry, 3) impractical above ~1 GB. DuckDB is multi-threaded, RAM-efficient (can spill to disk), and 5-50x faster on standard analysis workloads. A Pandas DataFrame is interchangeable 1:1 with a DuckDB table via Arrow format.

For which cases is DuckDB NOT meant? As production DB for web applications (that is Postgres' job). As multi-user server (DuckDB is embedded, one file lock per writer). As cluster for huge data above 100 GB (that is ClickHouse territory). DuckDB is an analysis tool, not an operational store.

In May 2026, DuckDB is mandatory in many data-science stacks: Apache Airflow, dbt, Dagster, Streamlit apps have native DuckDB integration. MotherDuck (DuckDB Cloud) is available as option for shared DuckDB databases -- relevant for teams sharing DuckDB workflows across analysts.

How it works

DuckDB is a C++ library with bindings for all common languages. The simplest use case in Python:

import duckdb

# In-memory query on a CSV result = duckdb.sql(""" SELECT account, count() AS n, sum(amount) AS total FROM '/data/bookings.csv' WHERE year = 2025 GROUP BY account ORDER BY total DESC LIMIT 20 """) result.show()

# Pandas integration import pandas as pd df = pd.read_csv('/data/clients.csv') result = duckdb.sql("SELECT * FROM df WHERE industry = 'Fiduciary'").fetchdf()

The query engine is vectorized execution -- operations are performed on batches of thousands of rows simultaneously, which optimally uses modern CPU SIMD instructions. Compression is columnar-typically efficient (10-30x on realistic data).

DuckDB supports persistent files (.duckdb) with ACID transactions, or in-memory. A .duckdb file can contain multiple tables, views, and schemas. Schema evolution is possible via ALTER TABLE, but DuckDB is primarily optimized for read-heavy workloads -- frequent updates and deletes on large tables are not the sweet spot.

External data access:

# Direct Parquet read from S3 (with httpfs extension) INSTALL httpfs; LOAD httpfs; SELECT * FROM 's3://bucket/data.parquet' LIMIT 100;

# JSON read with schema inference SELECT * FROM read_json_auto('log.jsonl');

# Read Iceberg tables INSTALL iceberg; LOAD iceberg; SELECT * FROM iceberg_scan('s3://bucket/iceberg-table/');

Extensions in May 2026: httpfs (S3/HTTP), parquet, json, vss (vector similarity search), spatial (geo), iceberg (Apache Iceberg), delta (Delta Lake), postgres_scanner (live read from Postgres), mysql_scanner.

Multi-threading is on by default, DuckDB uses all available CPU cores. On an 8-core laptop, aggregate queries over 10 GB Parquet typically run in 1-5 seconds.

MotherDuck as cloud variant: DuckDB databases are hosted in the cloud, local clients can send queries to the cloud and process results locally. Hybrid mode: part of the data in the cloud, part locally. EU region available.

DuckDB to production in 5 steps

01Validate use case: is it analysis (read-heavy, aggregates) or OLTP (write-heavy)? For OLTP -> SQLite or Postgres. For analysis -> DuckDB.
02Install DuckDB as library: pip install duckdb (Python), npm install @duckdb/duckdb (Node), CRAN package (R). Embedded mode suffices for 90 percent of cases.
03Data pipeline: Postgres/MySQL/CSV/Parquet -> DuckDB table. For recurring analyses persist a snapshot as .duckdb file.
04Develop queries: standard SQL, vectorized execution makes aggregates very fast. WITH CTEs and window functions fully supported. Visualization via Streamlit, Dash, or Jupyter.
05Sharing strategy: for multi-analyst teams check MotherDuck (EU region available), or DuckDB files in S3/Storage Box with read-only sync.

When to use DuckDB

The right choice when (a) data analysis should run locally without server effort, (b) Pandas hits its limits (RAM, single-threading), or (c) data lives in S3 Parquet, CSV, or JSON and should be queried directly.

Concrete scenarios: fiduciary reporting over client booking snapshots (Postgres export as Parquet, DuckDB query locally), log analysis from Loki export, web analytics snapshots for ad-hoc reports, KPI dashboards in Streamlit with DuckDB as engine, data quality checks over CSV imports, Pandas replacement in Python ETL pipelines.

New 2026 use cases: dbt with DuckDB as compute engine (instead of Snowflake/BigQuery for small teams), Dagster/Airflow tasks with DuckDB transformations, AI data preparation (embedding datasets, RAG source files in DuckDB table plus VSS extension for search).

MotherDuck (DuckDB Cloud) is attractive for teams sharing DuckDB workflows: shared data layer in the cloud, local query performance, EU region. Self-host stays the standard for pure local analysis -- no cloud needed.

For Excel power users: DuckDB is the natural next-step tech. Anyone processing 1 million rows in Excel and hitting performance limits can continue the same workflow in DuckDB with 100 million rows -- SQL-like language, comparable mental model.

When not to use

Never as production DB for web applications with multiple users. DuckDB is embedded and single-writer -- one file lock per write process. Anyone building multi-user applications belongs in Postgres or ClickHouse.

For high-frequency OLTP workloads (frequent single INSERT-UPDATE-DELETE), DuckDB is not optimized. Writes are slower than in SQLite, because the columnar format requires reorganization on every write. DuckDB is built for read-heavy analytics.

For really large data (> 100 GB per database) DuckDB becomes unwieldy. ClickHouse with cluster setup is right here. DuckDB is optimal for single-node analysis of 1-100 GB data -- beyond that, a server-based system pays off.

For real-time data pipelines with continuous updates, DuckDB is not the right tool. DuckDB is optimized for batch loads -- data comes once, is queried many times. Realtime streaming belongs in ClickHouse or Kafka + ClickHouse.

For multi-user sharing without MotherDuck: local DuckDB files are not network-share safe. If multiple analysts need to access a .duckdb file concurrently, you need MotherDuck or another sharing layer.

Trade-offs

STRENGTHS

MIT license, fully open source, no open-core model
Embedded without server, one .duckdb file or in-memory
50-100x faster than Pandas/SQLite on aggregate workloads
Direct read on CSV/Parquet/JSON/Iceberg without prior import
Multi-threading and vectorized execution out of the box

WEAKNESSES

Not suited for multi-user production apps (embedded, single-writer)
Not for high-frequency OLTP (frequent single writes slow)
No native replication or cluster mode (MotherDuck as workaround)
Does not scale beyond 100 GB per database optimally
Sharing between analysts needs MotherDuck or external sync solution

FAQ

DuckDB or Pandas in Python in 2026?

At < 100 MB data: Pandas, still ergonomic and fast enough. From 1 GB data: DuckDB, significantly faster and not RAM-limited (can spill to disk). DuckDB-Pandas interop via Arrow is direct -- you can combine both.

DuckDB or ClickHouse for an SME?

For local analysis for 1-5 analysts without multi-user concurrent access: DuckDB. Embedded, MIT license, zero server effort. For multi-user analytics server for 10+ users or streaming data: ClickHouse. DuckDB often complements ClickHouse -- ClickHouse as server store, DuckDB snapshots for ad-hoc analysis.

Can DuckDB read directly from Hetzner Storage Box?

Yes, with the httpfs extension. Storage Box has S3-compatible endpoints (via rclone or direct S3 gateway). SELECT * FROM 's3://bucket/data.parquet' works with correctly set S3 credentials. This turns Storage Box into a data lake for DuckDB analyses.

What does MotherDuck add compared to local DuckDB?

Cloud hosting for shared DuckDB databases, EU region available. Hybrid mode lets local queries access cloud data with caching. Sensible for 5+ analysts needing shared data. For single analyst or small team, local DuckDB without cloud suffices.

Sources

DuckDB 1.2 release notes · 2026-05
DuckDB documentation - extensions and SQL · 2026-05
DuckDB MIT License · 2026-05
MotherDuck Cloud pricing and EU region · 2026-05

FITS YOUR STACK?

What this looks like in your business – a 30-minute intro call.

Book a call