DuckDB
The columnar, in-process analytics database. Reads Parquet, CSV, and JSON like they were native tables; runs aggregations over hundreds of millions of rows on a laptop. SQLite for analytics — same single-file ethos, vectorized engine, MIT license.
DuckDB is what happens when you ask: "What if SQLite, but for analytics?" An in-process, single-file, columnar, vectorized SQL engine that reads Parquet and CSV like first-class tables, runs on a laptop, and beats specialized analytical databases on workloads that fit in memory. Official docs at duckdb.org/docs own the SQL reference.
What it is
A columnar, vectorized analytical SQL engine, designed to live inside an application — no server, no port, no daemon. Single binary, embeddable C++ library, language bindings for Python, R, Node, Rust, Go, Java. MIT-licensed. Hot-path columnar execution: SIMD where it helps, vectorized operators, predicate pushdown into Parquet readers, automatic parallelism.
Distribution: pip install duckdb, npm install duckdb, a CLI binary, a WASM build that runs in the browser. The motherduck.com offering puts a managed layer in front for collaboration; the engine itself remains the free open-source one.
The pitch: when the questions you ask are analytical ("sum revenue by month over five years"), file-shaped ("query this Parquet directory without loading it"), or single-process ("a Jupyter notebook that needs to crunch 200M rows"), DuckDB is the right primitive. The wrong primitive in Postgres is "scan a hundred million rows on a row-store with no compression." The wrong primitive in pandas is "load 50GB into RAM." DuckDB threads the needle.
When to use it
Reach for it when:
- The workload is OLAP / analytics — aggregations, GROUP BY, joins between modest tables. Columnar layout pays off.
- You have Parquet / CSV / JSON files on disk or S3 —
SELECT * FROM 'data/*.parquet'works out of the box. - The data fits on one box but pandas is choking — DuckDB will eat data that crashes pandas because it streams and spills.
- You want SQL in a notebook —
duckdb.sql("SELECT ...").df()is the Python ergonomic; same for R, Node, etc. - You need embedded analytics in an application — a desktop tool, a CLI, a sidecar service that needs to crunch numbers without a separate database server.
- You want to prototype a data warehouse before committing to Snowflake / BigQuery / ClickHouse. Same SQL dialect family; trivial graduation path.
Skip it when:
- The workload is OLTP — concurrent writers, small transactional updates. DuckDB serializes writes and is not designed for many tiny mutations. Use Postgres.
- You need a server with many concurrent users — DuckDB is embedded; one process per "session." Use a real warehouse for shared querying at scale.
- The dataset is petabyte-scale — that’s the warehouse’s job. DuckDB shines from MBs to hundreds of GBs.
- You need point lookups by key at high QPS — wrong layout. Postgres or a KV store.
At a glance
Core ideas
- Columnar storage — values from the same column laid out together. Wins on aggregations, wins on compression, wins on cache behavior for analytical workloads.
- Vectorized execution — operators process batches of values (typically 1024) at a time, not one row at a time. Closer to "compiled query" performance without the compile.
- Parquet as a native table — no import step. Predicate pushdown, projection pushdown, parallel reads. Reading 10 Parquet files in parallel is the default, not a configuration.
- Httpfs extension —
SELECT * FROM 's3://bucket/path/*.parquet'. Query data lake files directly with no copy. - Lossless integration with pandas / Arrow / Polars — zero-copy where possible. Use DuckDB for what it’s good at, hand back to the dataframe library for the rest.
Things DuckDB does that nothing else does as cleanly
- CSV/JSON auto-detection. Open a CSV; DuckDB sniffs the types. Usually right.
- Date arithmetic in SQL.
DATE_TRUNC('month', ts)and friends, with predictable semantics. - Window functions and complex SQL. Postgres-flavored SQL — most queries port directly.
- WASM build. Run DuckDB in a browser tab against Parquet on a CDN. Real applications already do this.
How to integrate
Default integration for a new analytical workload:
- Install in-process.
pip install duckdb/npm install duckdb. One line. - Decide: persistent or ephemeral.
duckdb.connect()is an in-memory database.duckdb.connect('mydata.duckdb')writes to a file. The file is one binary blob, easy to ship. - Point at the files you have.
SELECT * FROM 'logs/*.parquet'works without an import step. Materialize into a table (CREATE TABLE x AS SELECT ...) only if the same query repeats. - Use the dataframe boundary. Push the heavy aggregation into DuckDB; pull the result back as a pandas / Polars / Arrow dataframe.
- Parallelism is automatic. DuckDB will use all your cores. Tune
SET threads=Nonly when you have a reason. - For sharing, materialize to Parquet or upload to MotherDuck. Embedded means one writer; collaboration wants a different surface.
In the GL stack
DuckDB is not in production today but is the right tool for several near-term moves:
- Janus / Plutus analytics. Monthly cashflow, position rollups, opportunity scoring against historical Pendle / Aerodrome snapshots. The data lives in JSON files and Postgres exports today; DuckDB would let those become first-class queryable surfaces without ETL.
- Eval corpus analysis for compiled DSPy modules — joining trace files, computing per-prompt rollups. The shape is exactly what DuckDB exists to do.
- paiddaily.io research queries that don’t belong in production Postgres — ad hoc questions across Parquet exports of historical market state.
Status today: watching, not applied. Will move to applied the first time a notebook query needs to outgrow pandas without taking on a server.
Gotchas
- Single writer. Embedded means one process holds the write lock. Multiple readers via
read_only=Trueare fine; concurrent writers across processes are not the design point. - Memory pressure. DuckDB streams and spills, but a poorly-chosen plan against a huge dataset can still eat your RAM. Watch
EXPLAIN ANALYZE. - Filesystem layouts matter. A directory of small Parquet files is slower than a directory of right-sized ones (~50-500MB each). The data lake hygiene that helps Spark also helps DuckDB.
- Schema drift in CSV/JSON. Auto-detection is great until two files disagree on a column type. Specify the schema when ingestion is repeated.
- Not a replacement for a warehouse at scale. When the data outgrows one machine, the answer is a warehouse, not "bigger DuckDB."
Risks
- Young project, fast-moving. Major versions ship breaking changes; the on-disk format has been stable but has changed. Pin versions in production.
- Single-vendor stewardship. DuckDB Labs and MotherDuck are the primary funders. The OSS engine is MIT, which mitigates lock-in; but the velocity depends on the company staying viable.
- It is so good at the laptop scale that teams over-rely on it. When you genuinely need a multi-user warehouse, the graduation hurts more if you delayed it.
Alternatives
- ClickHouse — Wins when you need a real-time analytical server with many concurrent users. Loses on embedded ergonomics. Different operating model entirely.
- Polars — Wins when the whole pipeline is in a dataframe shape and SQL is not desired. Loses on raw-Parquet-as-table ergonomics. Often complementary; the two interop cleanly.
- Apache DataFusion — Wins when you want to embed a query engine into a Rust application. Loses on ecosystem maturity. Same columnar/vectorized lineage.
- BigQuery / Snowflake — Wins at warehouse scale with many users and TB+ data. Loses on cost and on "I just want to query a Parquet file on my laptop." Use DuckDB to prototype, the warehouse to scale.
- pandas — Wins for small dataframes and one-off transforms. Loses on anything that doesn’t fit in memory. DuckDB + pandas together is better than either alone.
