Skip to content

eracle/django-duckdb-readonly

Repository files navigation

django-duckdb-readonly

A read-only Django database backend for DuckDB. Query Parquet files on S3/GCS, local .duckdb files, or an in-memory DuckDB instance through the Django ORM — .filter(), .get(), .order_by(), .values(), aggregations, the whole familiar surface — without standing up a Postgres serving layer.

Writes are blocked at the cursor level: INSERT, UPDATE, DELETE, CREATE, ALTER, DROP all raise NotSupportedError. The point of this backend is to treat a DuckDB store as a read-only analytical replica that the Django ORM can query directly.

Why

If your "warehouse" is already Parquet on object storage, you probably don't need a separate transactional database just to serve it. Pipelines write Parquet, DuckDB reads it (with predicate pushdown and column pruning), the ORM treats DuckDB like any other database. You keep Postgres for the parts that actually mutate — users, sessions, billing — and let the analytical read path skip a sync step entirely.

This is the Django side of the DuckDB + Postgres hybrid pattern that projects like GlitchTip and EthicalAds use. The Django ecosystem didn't have a maintained backend for it; this is one.

Install

pip install django-duckdb-readonly

Requires Python 3.10+, Django 4.2+, DuckDB 1.0+.

Quickstart

# settings.py
DATABASES = {
    "default": {
        "ENGINE": "django.db.backends.postgresql",
        "NAME": "app",
        # ...your real DB for users, sessions, billing
    },
    "warehouse": {
        "ENGINE": "django_duckdb_readonly",
        "NAME": "/var/data/analytics.duckdb",   # or ":memory:"
        "OPTIONS": {
            "duckdb_read_only": True,
            "extensions": ["httpfs"],
            "init_sql": [
                # Expose remote Parquet as a SQL view. Django models then
                # map to "listings" via Meta.db_table.
                "CREATE VIEW listings AS "
                "SELECT * FROM read_parquet('gs://my-bucket/marts/listings/*.parquet')",
            ],
        },
    },
}

DATABASE_ROUTERS = ["myapp.routers.WarehouseRouter"]
# myapp/models.py
from django.db import models

class Listing(models.Model):
    id = models.IntegerField(primary_key=True)
    name = models.CharField(max_length=120)
    state = models.CharField(max_length=2)
    score = models.FloatField()

    class Meta:
        managed = False           # never migrate this table
        db_table = "listings"     # matches the view name in init_sql
        app_label = "warehouse_app"
# myapp/views.py
from django.shortcuts import render
from .models import Listing

def top_listings(request, state):
    qs = (
        Listing.objects
        .filter(state__iexact=state, score__gte=0.5)
        .order_by("-score")[:50]
    )
    return render(request, "listings/top.html", {"listings": qs})

That's it. The ORM emits SELECT … WHERE UPPER(state) = UPPER(?) AND score >= ? ORDER BY score DESC LIMIT 50, DuckDB pushes the predicates into the Parquet scan, and you skip the entire mart → Postgres sync.

Configuration

All options live under DATABASES[<alias>]['OPTIONS']:

Option Type Default Purpose
duckdb_read_only bool False Open the DuckDB file in read-only mode. Use this for .duckdb files that should never be mutated. Note: the Django ORM layer is always read-only — this flag controls the lower-level DuckDB file lock.
duckdb_config dict[str, Any] {} Passed to duckdb.connect(config=…). Useful for s3_region, memory_limit, threads, etc.
extensions list[str] [] Names of DuckDB extensions to INSTALL and LOAD on connect (e.g. ["httpfs", "spatial", "h3"]).
init_sql list[str] [] Raw SQL run once on each new connection, bypassing the read-only check. This is the hook for CREATE VIEW … AS SELECT * FROM read_parquet(…), attaching other DuckDB databases, or seeding fixture data for tests.

The standard NAME field accepts:

  • ":memory:" — ephemeral in-process DuckDB (great for tests + init_sql)
  • A filesystem path — opens or creates a .duckdb file
  • Any other path DuckDB's connect() accepts

Reading Parquet on S3 / GCS

DuckDB's httpfs extension reads remote Parquet directly. Two common shapes:

GCS via service-account JSON (matching gcloud auth application-default login):

"OPTIONS": {
    "extensions": ["httpfs"],
    "duckdb_config": {
        # DuckDB picks up GOOGLE_APPLICATION_CREDENTIALS automatically;
        # or set the credential here:
        # "gcs_credentials": "/path/to/service-account.json",
    },
    "init_sql": [
        "CREATE VIEW listings AS "
        "SELECT * FROM read_parquet('gs://my-bucket/marts/listings/*.parquet')",
    ],
}

S3 with explicit credentials:

"OPTIONS": {
    "extensions": ["httpfs"],
    "duckdb_config": {
        "s3_region": "us-east-1",
        "s3_access_key_id": os.environ["AWS_ACCESS_KEY_ID"],
        "s3_secret_access_key": os.environ["AWS_SECRET_ACCESS_KEY"],
    },
    "init_sql": [
        "CREATE VIEW events AS "
        "SELECT * FROM read_parquet('s3://my-bucket/events/*/*.parquet')",
    ],
}

See examples/parquet_on_gcs.py for a fully runnable example against a tiny local Parquet file (no cloud setup required).

Routing models to the warehouse

Use a standard Django database router. The typical pattern is "one app per analytical schema":

class WarehouseRouter:
    warehouse_app = "warehouse_app"

    def db_for_read(self, model, **hints):
        if model._meta.app_label == self.warehouse_app:
            return "warehouse"
        return None

    db_for_write = db_for_read

    def allow_migrate(self, db, app_label, model_name=None, **hints):
        if app_label == self.warehouse_app or db == "warehouse":
            return False
        return None

Every model in warehouse_app should set Meta.managed = False — the backend will hard-error on any DDL.

What's supported

Feature Status
.all(), .filter(), .exclude(), .get()
Field lookups (exact, iexact, in, gt, gte, lt, lte, isnull, contains, icontains, startswith, endswith, regex, iregex)
.order_by(), slicing, .distinct()
.values(), .values_list(), .only(), .defer()
Aggregations (Count, Sum, Avg, Min, Max)
Q() and F() expressions
Meta.managed = False models ✅ (recommended)
manage.py inspectdb <alias> against the DuckDB store
Multi-database routers
Writes through the ORM 🚫 hard-errors (by design)
migrate against the warehouse 🚫 hard-errors (by design)
select_related() / prefetch_related() across mart tables ⚠️ joins work where the underlying SQL is valid DuckDB, but cross-table FKs aren't introspected
DuckDB-native types (UUID, JSON, INTERVAL, STRUCT, LIST, MAP) ⚠️ scalar types work; nested types serialize as Python dict/list and you handle them in app code

Threading model

DuckDBPyConnection is not thread-safe by itself, but connection.cursor() returns an independent cursor per call — that's the unit of parallelism DuckDB documents ("Using Connections in Parallel Python Programs"). This backend leans on that: every DatabaseWrapper instance (one per thread, per Django's standard model) hands back the same physical duckdb connection for a given alias, and create_cursor() issues a fresh DuckDB cursor on every call.

The practical reason this matters: connection cold-start (INSTALL httpfs

  • remote-bucket auth + CREATE VIEW … read_parquet(…)) can be multi-second. A per-thread connection model under gunicorn --workers 1 --threads N would pay that cost N times. Sharing keeps it to once per process per alias.

A few consequences:

  • init_sql and connection_hook run exactly once per process per alias, not once per thread. Don't put per-request state in them.
  • DatabaseWrapper.close() is a no-op on the underlying connection — other threads may be mid-query. Use the classmethod DatabaseWrapper.close_shared(alias) to actually close the shared connection (test tearDown, credential rotation, etc.).
  • is_usable() → False does not auto-recover. If the shared connection goes bad, call close_shared(alias) and let the next request re-open.

Read-only enforcement

Two layers, intentionally independent:

  1. ORM layer (always on): every statement emitted by Django's compilers is checked against an allow-list of read keywords (SELECT, WITH, EXPLAIN, PRAGMA, etc.). Anything else raises NotSupportedError before it reaches DuckDB. This catches accidental .save(), .create(), .update(), .delete(), and any raw DDL.
  2. DuckDB connection layer (opt-in via duckdb_read_only=True): tells DuckDB itself to refuse writes at the file-lock level. Belt-and-braces; use it on production where the warehouse file must never be mutated.

init_sql runs through the raw connection, not the ORM cursor, so CREATE VIEW works there — that's the configured bootstrap path.

Roadmap

  • 0.1.x — bug fixes, more lookups, broader Django/Python version matrix
  • 0.2 — async cursor support (async def views), Django 6 compatibility polish
  • 0.3 — optional motherduck:// support (MotherDuck cloud-hosted DuckDB)

This is alpha software, but it's used in production for the read path of Sunnyplans — bug reports and PRs welcome.

Development

git clone https://github.com/eracle/django-duckdb-readonly.git
cd django-duckdb-readonly
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
python runtests.py

License

MIT. See LICENSE.

About

Read-only Django database backend for DuckDB. Query Parquet on S3/GCS, local .duckdb files, or in-memory through the Django ORM.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages