|
1 | 1 | # Dux |
2 | 2 |
|
3 | | -**DuckDB-native DataFrames for Elixir.** |
| 3 | +[](https://github.com/elixir-dux/dux/actions/workflows/ci.yml) |
| 4 | +[](https://hexdocs.pm/dux) |
| 5 | +[](https://hex.pm/packages/dux) |
4 | 6 |
|
5 | | -Dux is a dataframe library where DuckDB is the execution engine and the BEAM is the distributed runtime. Pipelines are lazy, operations compile to SQL CTEs, and DuckDB handles all the heavy lifting. |
| 7 | +**DuckDB-native dataframes for Elixir.** |
| 8 | + |
| 9 | +Dux gives you a [dplyr](https://dplyr.tidyverse.org)-style verb API backed by DuckDB's analytical engine, with built-in distributed execution across the BEAM. Pipelines are lazy, operations compile to SQL, and DuckDB handles columnar execution, vectorised aggregation, and predicate pushdown. |
6 | 10 |
|
7 | 11 | ```elixir |
8 | 12 | require Dux |
9 | 13 |
|
10 | 14 | Dux.from_parquet("s3://data/sales/**/*.parquet") |
11 | | -|> Dux.filter(amount > 100 and region == ^selected_region) |
| 15 | +|> Dux.filter(amount > 100 and region == "US") |
12 | 16 | |> Dux.mutate(revenue: price * quantity) |
13 | 17 | |> Dux.group_by(:product) |
14 | 18 | |> Dux.summarise(total: sum(revenue), orders: count(product)) |
15 | 19 | |> Dux.sort_by(desc: :total) |
16 | | -|> Dux.to_parquet("results.parquet", compression: :zstd) |
| 20 | +|> Dux.to_rows() |
17 | 21 | ``` |
18 | 22 |
|
19 | | -## Why Dux? |
| 23 | +## Design |
| 24 | + |
| 25 | +Dux is the successor to [Explorer](https://github.com/elixir-explorer/explorer). That means it borrows its verb design from dplyr and the tidyverse — constrained, composable operations that each do one thing well. If you've used `dplyr::filter()`, `mutate()`, `group_by() |> summarise()`, the Dux API will feel familiar. |
| 26 | + |
| 27 | +Where Dux diverges from Explorer: |
20 | 28 |
|
21 | | -- **The module IS the dataframe.** `Dux.filter(df, ...)` — no `Dux.DataFrame`, no `Dux.Series`. Just verbs that pipe. |
22 | | -- **Everything is lazy.** Operations accumulate until `compute/1`. DuckDB optimizes the full pipeline. |
23 | | -- **DuckDB-only.** No pluggable backends, no abstraction tax. Full access to DuckDB extensions, window functions, recursive CTEs. |
24 | | -- **Elixir expressions compile to SQL.** `Dux.filter(df, x > ^min_val)` becomes `WHERE x > $1` with parameter bindings. SQL injection safe by construction. |
25 | | -- **Distributed.** Ship `%Dux{}` structs to any BEAM node, compile to SQL there, execute against that node's local DuckDB. Fan out with the Coordinator, merge results. |
26 | | -- **Graph analytics.** `Dux.Graph` — a graph is two dataframes (vertices + edges). PageRank, shortest paths, connected components as verb compositions. |
27 | | -- **Nx interop.** Numeric columns become tensors via `Nx.LazyContainer`. Zero-copy where possible. |
| 29 | +- **The module IS the dataframe.** `Dux.filter(df, ...)` not `Dux.DataFrame.filter(df, ...)`. No Series API — all operations are dataframe-level. |
| 30 | +- **DuckDB is the only engine.** No pluggable backends, no abstraction tax. Full access to DuckDB's SQL functions, window functions, recursive CTEs, and 50+ extensions. |
| 31 | +- **Lazy by default.** Operations accumulate as an AST in `%Dux{}`. When you materialise (`compute/1`, `to_rows/1`), the whole pipeline compiles to a chain of SQL CTEs and DuckDB optimises end-to-end. |
| 32 | +- **Distributed on the BEAM.** `%Dux{}` is plain data — ship it to any BEAM node, compile to SQL there, execute against that node's local DuckDB. No function serialisation, no cluster manager, no heavyweight RPC. |
28 | 33 |
|
29 | 34 | ## Installation |
30 | 35 |
|
31 | 36 | ```elixir |
32 | 37 | def deps do |
33 | | - [ |
34 | | - {:dux, github: "elixir-dux/dux"} |
35 | | - ] |
| 38 | + [{:dux, "~> 0.2.0"}] |
36 | 39 | end |
37 | 40 | ``` |
38 | 41 |
|
39 | | -Dux is a pure Elixir project. The DuckDB engine is provided via ADBC — a precompiled driver downloaded automatically at compile time. No Rust, C++, or DuckDB compilation needed. |
| 42 | +Dux is a pure Elixir project. The DuckDB engine is provided via [ADBC](https://github.com/elixir-explorer/adbc) — a precompiled driver downloaded automatically at compile time. No Rust or C++ compilation needed. |
40 | 43 |
|
41 | | -## Quick start |
| 44 | +## Getting Started |
42 | 45 |
|
43 | 46 | ```elixir |
44 | 47 | require Dux |
45 | 48 |
|
46 | 49 | # Built-in datasets — no files needed |
47 | 50 | Dux.Datasets.flights() |
48 | 51 | |> Dux.filter(distance > 1000) |
49 | | -|> Dux.mutate(delay_per_mile: arr_delay / distance) |
50 | 52 | |> Dux.group_by(:origin) |
51 | 53 | |> Dux.summarise(avg_delay: avg(arr_delay), n: count(flight)) |
52 | 54 | |> Dux.sort_by(desc: :avg_delay) |
| 55 | +|> Dux.head(5) |
53 | 56 | |> Dux.to_rows() |
| 57 | +``` |
| 58 | + |
| 59 | +Every verb (`filter`, `mutate`, `group_by`, `summarise`, etc.) takes Elixir expressions via macros. Bare identifiers become column names. `^` interpolates Elixir values safely as parameter bindings: |
54 | 60 |
|
55 | | -# [%{"origin" => "EWR", "avg_delay" => 8.15, "n" => 1094}, ...] |
| 61 | +```elixir |
| 62 | +min_amount = 500 |
| 63 | +Dux.filter(df, amount > ^min_amount and status == "active") |
56 | 64 | ``` |
57 | 65 |
|
58 | | -## Verbs |
59 | | - |
60 | | -All operations are verbs on `%Dux{}` structs: |
61 | | - |
62 | | -| Verb | Description | |
63 | | -| ---------------- | ---------------------------------------------------------- | |
64 | | -| `filter/2` | Filter rows (macro: `filter(df, x > 10)`) | |
65 | | -| `mutate/2` | Add/replace columns (macro: `mutate(df, y: x * 2)`) | |
66 | | -| `select/2` | Keep columns | |
67 | | -| `discard/2` | Drop columns | |
68 | | -| `sort_by/2` | Sort rows (asc/desc) | |
69 | | -| `group_by/2` | Group for aggregation | |
70 | | -| `summarise/2` | Aggregate (macro: `summarise(df, total: sum(x))`) | |
71 | | -| `join/3` | Inner, left, right, cross, anti, semi joins | |
72 | | -| `head/2` | First N rows | |
73 | | -| `slice/3` | Offset + limit | |
74 | | -| `distinct/1` | Deduplicate | |
75 | | -| `drop_nil/2` | Remove rows with nil values | |
76 | | -| `rename/2` | Rename columns | |
77 | | -| `pivot_wider/4` | Long → wide (DuckDB PIVOT) | |
78 | | -| `pivot_longer/3` | Wide → long (DuckDB UNPIVOT) | |
79 | | -| `concat_rows/1` | UNION ALL | |
80 | | -| `compute/1` | Execute the pipeline | |
81 | | -| `to_rows/1` | Execute and return list of maps (`atom_keys: true` option) | |
82 | | -| `to_columns/1` | Execute and return column map | |
83 | | -| `peek/2` | Print formatted table preview | |
84 | | -| `n_rows/1` | Count rows | |
85 | | -| `sql_preview/2` | Show generated SQL (`pretty: true` option) | |
86 | | - |
87 | | -The `_with` variants (`filter_with/2`, `mutate_with/2`, `summarise_with/2`) accept raw SQL strings for programmatic use. |
| 66 | +The `_with` variants accept raw DuckDB SQL for anything the macro doesn't cover: |
| 67 | + |
| 68 | +```elixir |
| 69 | +Dux.mutate_with(df, rank: "ROW_NUMBER() OVER (PARTITION BY dept ORDER BY salary DESC)") |
| 70 | +``` |
88 | 71 |
|
89 | 72 | ## IO |
90 | 73 |
|
91 | | -DuckDB handles all file formats and remote access natively: |
| 74 | +Read and write CSV, Parquet, NDJSON, Excel, and database tables: |
92 | 75 |
|
93 | 76 | ```elixir |
94 | | -# Read |
95 | | -Dux.from_csv("data.csv", delimiter: "\t") |
96 | | -Dux.from_parquet("data/**/*.parquet") |
97 | | -Dux.from_ndjson("events.ndjson") |
98 | | -Dux.from_query("SELECT * FROM read_parquet('s3://bucket/data.parquet')") |
99 | | - |
100 | | -# Write |
101 | | -Dux.to_csv(df, "output.csv") |
102 | | -Dux.to_parquet(df, "output.parquet", compression: :zstd) |
103 | | -Dux.to_ndjson(df, "output.ndjson") |
| 77 | +df = Dux.from_parquet("s3://bucket/data/**/*.parquet") |
| 78 | +df = Dux.from_csv("data.csv", delimiter: "\t") |
| 79 | +df = Dux.from_excel("sales.xlsx", sheet: "Q1") |
| 80 | + |
| 81 | +Dux.to_parquet(df, "output/", partition_by: [:year, :month]) |
| 82 | +Dux.to_excel(df, "report.xlsx") |
| 83 | +Dux.insert_into(df, "pg.public.events", create: true) |
104 | 84 | ``` |
105 | 85 |
|
106 | | -S3, HTTP, Postgres, MySQL, SQLite — all via DuckDB extensions. No separate libraries needed. |
| 86 | +Cross-source queries via DuckDB's ATTACH — Postgres, MySQL, SQLite, Iceberg, Delta, DuckLake: |
| 87 | + |
| 88 | +```elixir |
| 89 | +Dux.attach(:warehouse, "host=db.internal dbname=analytics", type: :postgres) |
| 90 | +customers = Dux.from_attached(:warehouse, "public.customers") |
| 91 | + |
| 92 | +Dux.from_parquet("s3://lake/orders/*.parquet") |
| 93 | +|> Dux.join(customers, on: :customer_id) |
| 94 | +|> Dux.group_by(:region) |
| 95 | +|> Dux.summarise(revenue: sum(amount)) |
| 96 | +|> Dux.to_rows() |
| 97 | +``` |
107 | 98 |
|
108 | | -## Distributed queries |
| 99 | +## Distributed Execution |
109 | 100 |
|
110 | | -Dux distributes analytical workloads across a BEAM cluster: |
| 101 | +Mark a pipeline for distributed execution with `distribute/2`. The same verbs work — Dux handles partitioning, fan-out, and merge automatically: |
111 | 102 |
|
112 | 103 | ```elixir |
113 | | -# Workers auto-register via :pg |
114 | 104 | workers = Dux.Remote.Worker.list() |
115 | 105 |
|
116 | | -# Mark for distributed, then use the same verbs |
117 | | -Dux.from_parquet("data/**/*.parquet") |
| 106 | +Dux.from_parquet("s3://lake/events/**/*.parquet") |
118 | 107 | |> Dux.distribute(workers) |
119 | | -|> Dux.filter(amount > 100) |
| 108 | +|> Dux.filter(year == 2024) |
120 | 109 | |> Dux.group_by(:region) |
121 | | -|> Dux.summarise(total: sum(amount)) |
| 110 | +|> Dux.summarise(total: sum(revenue)) |
122 | 111 | |> Dux.to_rows() |
123 | 112 | ``` |
124 | 113 |
|
125 | | -No function serialization — `%Dux{}` is plain data. Ship it anywhere, compile to SQL there. No cluster manager — just `libcluster` + `:pg`. No heavyweight RPC — just `:erpc.multicall`. |
| 114 | +Under the hood: the Coordinator partitions files across workers (size-balanced, with Hive partition pruning), each worker compiles and executes SQL against its local DuckDB, and the Merger re-aggregates results. Workers read from and write to storage directly — no data funnels through the coordinator. |
126 | 115 |
|
127 | | -## Graph analytics |
| 116 | +Distributed writes work the same way: |
128 | 117 |
|
129 | 118 | ```elixir |
130 | | -graph = Dux.Graph.new(vertices: users, edges: follows) |
131 | | - |
132 | | -# All algorithms are verb compositions |
133 | | -graph |> Dux.Graph.pagerank() |> Dux.sort_by(desc: :rank) |> Dux.head(10) |
134 | | -graph |> Dux.Graph.shortest_paths(start_node) |
135 | | -graph |> Dux.Graph.connected_components() |
136 | | -graph |> Dux.Graph.triangle_count() |
137 | | - |
138 | | -# Distribute graph across workers |
139 | | -graph |> Dux.Graph.distribute(workers) |> Dux.Graph.pagerank() |
| 119 | +Dux.from_parquet("s3://input/**/*.parquet") |
| 120 | +|> Dux.distribute(workers) |
| 121 | +|> Dux.filter(status == "active") |
| 122 | +|> Dux.to_parquet("s3://output/", partition_by: :year) |
140 | 123 | ``` |
141 | 124 |
|
142 | | -## Nx interop |
143 | | - |
144 | | -Numeric columns become tensors: |
| 125 | +Attach Postgres and distribute reads with `partition_by:`: |
145 | 126 |
|
146 | 127 | ```elixir |
147 | | -tensor = Dux.to_tensor(df, :price) |
148 | | -# #Nx.Tensor<f64[1000] [...]> |
| 128 | +Dux.from_attached(:pg, "public.orders", partition_by: :id) |
| 129 | +|> Dux.distribute(workers) |
| 130 | +|> Dux.insert_into("pg.public.summary", create: true) |
149 | 131 | ``` |
150 | 132 |
|
151 | | -`Dux` implements `Nx.LazyContainer` for use in `defn`. |
| 133 | +See the [Distributed Execution](https://hexdocs.pm/dux/distributed.html) guide for the full architecture — aggregate rewrites, broadcast vs shuffle joins, streaming merge, and fault tolerance. |
152 | 134 |
|
153 | | -## Raw SQL escape hatch |
| 135 | +## Graph Analytics |
154 | 136 |
|
155 | | -For anything the macro doesn't support — window functions, CASE WHEN, PIVOT, CTEs — use the `_with` variants with raw DuckDB SQL: |
| 137 | +A graph is two dataframes. All algorithms return `%Dux{}` — pipe into any verb: |
156 | 138 |
|
157 | 139 | ```elixir |
158 | | -# Window functions |
159 | | -Dux.mutate_with(df, rank: "ROW_NUMBER() OVER (PARTITION BY \"dept\" ORDER BY \"salary\" DESC)") |
| 140 | +graph = Dux.Graph.new(vertices: users, edges: follows) |
160 | 141 |
|
161 | | -# CASE WHEN |
162 | | -Dux.mutate_with(df, tier: "CASE WHEN amount > 1000 THEN 'high' ELSE 'low' END") |
| 142 | +graph |> Dux.Graph.pagerank() |> Dux.sort_by(desc: :rank) |> Dux.head(10) |
| 143 | +graph |> Dux.Graph.shortest_paths(start_node) |
| 144 | +graph |> Dux.Graph.connected_components() |
| 145 | +``` |
163 | 146 |
|
164 | | -# Pivot |
165 | | -Dux.from_query("PIVOT sales ON product USING SUM(amount) GROUP BY region") |
| 147 | +## Guides |
166 | 148 |
|
167 | | -# Any DuckDB SQL |
168 | | -Dux.from_query("SELECT * FROM read_parquet('s3://bucket/data.parquet') WHERE year = 2025") |
169 | | -``` |
| 149 | +- [Getting Started](https://hexdocs.pm/dux/getting-started.html) — core concepts, expressions, pipelines |
| 150 | +- [Data IO](https://hexdocs.pm/dux/data-io.html) — CSV, Parquet, Excel, NDJSON, database writes |
| 151 | +- [Transformations](https://hexdocs.pm/dux/transformations.html) — filter, mutate, window functions |
| 152 | +- [Joins & Reshape](https://hexdocs.pm/dux/joins-and-reshape.html) — join types, ASOF joins, pivots |
| 153 | +- [Distributed Execution](https://hexdocs.pm/dux/distributed.html) — architecture, partitioning, distributed IO |
| 154 | +- [Graph Analytics](https://hexdocs.pm/dux/graph-analytics.html) — PageRank, shortest paths, components |
| 155 | +- [Cheatsheet](https://hexdocs.pm/dux/cheatsheet.html) — quick reference for all verbs |
170 | 156 |
|
171 | 157 | ## License |
172 | 158 |
|
173 | 159 | Dual-licensed under Apache 2.0 and MIT. See [LICENSE-APACHE](LICENSE-APACHE) and [LICENSE-MIT](LICENSE-MIT). |
174 | 160 |
|
175 | 161 | ## Links |
176 | 162 |
|
177 | | -- [Documentation](https://hexdocs.pm/dux) |
| 163 | +- [HexDocs](https://hexdocs.pm/dux) |
| 164 | +- [Hex.pm](https://hex.pm/packages/dux) |
178 | 165 | - [GitHub](https://github.com/elixir-dux/dux) |
179 | 166 | - [Changelog](CHANGELOG.md) |
0 commit comments