Skip to content

Commit 5e6a32b

Browse files
cigraingerclaude
andauthored
chore: prep for v0.2.0 (#25)
* chore: prep for v0.2.0 release Bump version to 0.2.0. Rewrite README with: - CI, HexDocs, Hex.pm badges - dplyr design lineage and philosophy section - Concise feature overview (IO, distributed, cross-source, graph) - Distributed execution as a first-class section with code examples - Links to all HexDocs guides Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: update all guide versions to ~> 0.2.0 Replace github: "elixir-dux/dux" with {:dux, "~> 0.2.0"} in all livemd Mix.install calls. Incorporate user edits to README (Explorer lineage, C++ mention removal). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: update changelog --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent e63c82b commit 5e6a32b

8 files changed

Lines changed: 113 additions & 104 deletions

CHANGELOG.md

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,9 +4,14 @@ All notable changes to Dux are documented here.
44

55
## [Unreleased]
66

7+
### Bug Fixes
8+
9+
- Nx zero-copy types, connected components convergence, ADBC v0.10.0 (#14)
10+
711
### Documentation
812

913
- Comprehensive documentation uplift (#3)
14+
- Comprehensive distributed execution guide + docs uplift (#24)
1015

1116
### Features
1217

@@ -20,10 +25,27 @@ All notable changes to Dux are documented here.
2025
- Streaming merger for lattice-compatible distributed aggregation (#8)
2126
- ASOF JOIN + JSON processing verbs (#9)
2227
- Cross-source joins via ATTACH (#10)
28+
- Update ADBC + zero-copy IPC ingest and Nx tensor path (#12)
29+
- Insert_into verb + distributed DuckLake reads (#13)
30+
- Size-balanced file assignment in Partitioner (#15)
31+
- Hive partition pruning in distributed reads (#18)
32+
- Distributed writes — workers write directly to storage (#19)
33+
- Distributed Postgres reads via hash-partitioned ATTACH (#20)
34+
- Partition_by option for Hive-partitioned Parquet output (#21)
35+
- Distributed insert_into — workers INSERT in parallel (#22)
36+
- Excel IO — from_excel/2 and to_excel/2 (#23)
2337

2438
### Miscellaneous
2539

2640
- Bump to dev version
41+
- Docs touchups (#11)
42+
- Expand CI matrix to full OTP × Elixir lattice (#17)
43+
- Prep for v0.2.0 release
44+
- Update all guide versions to ~> 0.2.0
45+
46+
### Testing
47+
48+
- Add peer tests for size-balanced partitioner (#16)
2749

2850
## [0.1.1] - 2026-03-22
2951

README.md

Lines changed: 85 additions & 98 deletions
Original file line numberDiff line numberDiff line change
@@ -1,179 +1,166 @@
11
# Dux
22

3-
**DuckDB-native DataFrames for Elixir.**
3+
[![CI](https://github.com/elixir-dux/dux/actions/workflows/ci.yml/badge.svg)](https://github.com/elixir-dux/dux/actions/workflows/ci.yml)
4+
[![Docs](https://img.shields.io/badge/hex.pm-docs-green.svg?style=flat)](https://hexdocs.pm/dux)
5+
[![Hex.pm](https://img.shields.io/hexpm/v/dux.svg)](https://hex.pm/packages/dux)
46

5-
Dux is a dataframe library where DuckDB is the execution engine and the BEAM is the distributed runtime. Pipelines are lazy, operations compile to SQL CTEs, and DuckDB handles all the heavy lifting.
7+
**DuckDB-native dataframes for Elixir.**
8+
9+
Dux gives you a [dplyr](https://dplyr.tidyverse.org)-style verb API backed by DuckDB's analytical engine, with built-in distributed execution across the BEAM. Pipelines are lazy, operations compile to SQL, and DuckDB handles columnar execution, vectorised aggregation, and predicate pushdown.
610

711
```elixir
812
require Dux
913

1014
Dux.from_parquet("s3://data/sales/**/*.parquet")
11-
|> Dux.filter(amount > 100 and region == ^selected_region)
15+
|> Dux.filter(amount > 100 and region == "US")
1216
|> Dux.mutate(revenue: price * quantity)
1317
|> Dux.group_by(:product)
1418
|> Dux.summarise(total: sum(revenue), orders: count(product))
1519
|> Dux.sort_by(desc: :total)
16-
|> Dux.to_parquet("results.parquet", compression: :zstd)
20+
|> Dux.to_rows()
1721
```
1822

19-
## Why Dux?
23+
## Design
24+
25+
Dux is the successor to [Explorer](https://github.com/elixir-explorer/explorer). That means it borrows its verb design from dplyr and the tidyverse — constrained, composable operations that each do one thing well. If you've used `dplyr::filter()`, `mutate()`, `group_by() |> summarise()`, the Dux API will feel familiar.
26+
27+
Where Dux diverges from Explorer:
2028

21-
- **The module IS the dataframe.** `Dux.filter(df, ...)` — no `Dux.DataFrame`, no `Dux.Series`. Just verbs that pipe.
22-
- **Everything is lazy.** Operations accumulate until `compute/1`. DuckDB optimizes the full pipeline.
23-
- **DuckDB-only.** No pluggable backends, no abstraction tax. Full access to DuckDB extensions, window functions, recursive CTEs.
24-
- **Elixir expressions compile to SQL.** `Dux.filter(df, x > ^min_val)` becomes `WHERE x > $1` with parameter bindings. SQL injection safe by construction.
25-
- **Distributed.** Ship `%Dux{}` structs to any BEAM node, compile to SQL there, execute against that node's local DuckDB. Fan out with the Coordinator, merge results.
26-
- **Graph analytics.** `Dux.Graph` — a graph is two dataframes (vertices + edges). PageRank, shortest paths, connected components as verb compositions.
27-
- **Nx interop.** Numeric columns become tensors via `Nx.LazyContainer`. Zero-copy where possible.
29+
- **The module IS the dataframe.** `Dux.filter(df, ...)` not `Dux.DataFrame.filter(df, ...)`. No Series API — all operations are dataframe-level.
30+
- **DuckDB is the only engine.** No pluggable backends, no abstraction tax. Full access to DuckDB's SQL functions, window functions, recursive CTEs, and 50+ extensions.
31+
- **Lazy by default.** Operations accumulate as an AST in `%Dux{}`. When you materialise (`compute/1`, `to_rows/1`), the whole pipeline compiles to a chain of SQL CTEs and DuckDB optimises end-to-end.
32+
- **Distributed on the BEAM.** `%Dux{}` is plain data — ship it to any BEAM node, compile to SQL there, execute against that node's local DuckDB. No function serialisation, no cluster manager, no heavyweight RPC.
2833

2934
## Installation
3035

3136
```elixir
3237
def deps do
33-
[
34-
{:dux, github: "elixir-dux/dux"}
35-
]
38+
[{:dux, "~> 0.2.0"}]
3639
end
3740
```
3841

39-
Dux is a pure Elixir project. The DuckDB engine is provided via ADBC — a precompiled driver downloaded automatically at compile time. No Rust, C++, or DuckDB compilation needed.
42+
Dux is a pure Elixir project. The DuckDB engine is provided via [ADBC](https://github.com/elixir-explorer/adbc) — a precompiled driver downloaded automatically at compile time. No Rust or C++ compilation needed.
4043

41-
## Quick start
44+
## Getting Started
4245

4346
```elixir
4447
require Dux
4548

4649
# Built-in datasets — no files needed
4750
Dux.Datasets.flights()
4851
|> Dux.filter(distance > 1000)
49-
|> Dux.mutate(delay_per_mile: arr_delay / distance)
5052
|> Dux.group_by(:origin)
5153
|> Dux.summarise(avg_delay: avg(arr_delay), n: count(flight))
5254
|> Dux.sort_by(desc: :avg_delay)
55+
|> Dux.head(5)
5356
|> Dux.to_rows()
57+
```
58+
59+
Every verb (`filter`, `mutate`, `group_by`, `summarise`, etc.) takes Elixir expressions via macros. Bare identifiers become column names. `^` interpolates Elixir values safely as parameter bindings:
5460

55-
# [%{"origin" => "EWR", "avg_delay" => 8.15, "n" => 1094}, ...]
61+
```elixir
62+
min_amount = 500
63+
Dux.filter(df, amount > ^min_amount and status == "active")
5664
```
5765

58-
## Verbs
59-
60-
All operations are verbs on `%Dux{}` structs:
61-
62-
| Verb | Description |
63-
| ---------------- | ---------------------------------------------------------- |
64-
| `filter/2` | Filter rows (macro: `filter(df, x > 10)`) |
65-
| `mutate/2` | Add/replace columns (macro: `mutate(df, y: x * 2)`) |
66-
| `select/2` | Keep columns |
67-
| `discard/2` | Drop columns |
68-
| `sort_by/2` | Sort rows (asc/desc) |
69-
| `group_by/2` | Group for aggregation |
70-
| `summarise/2` | Aggregate (macro: `summarise(df, total: sum(x))`) |
71-
| `join/3` | Inner, left, right, cross, anti, semi joins |
72-
| `head/2` | First N rows |
73-
| `slice/3` | Offset + limit |
74-
| `distinct/1` | Deduplicate |
75-
| `drop_nil/2` | Remove rows with nil values |
76-
| `rename/2` | Rename columns |
77-
| `pivot_wider/4` | Long → wide (DuckDB PIVOT) |
78-
| `pivot_longer/3` | Wide → long (DuckDB UNPIVOT) |
79-
| `concat_rows/1` | UNION ALL |
80-
| `compute/1` | Execute the pipeline |
81-
| `to_rows/1` | Execute and return list of maps (`atom_keys: true` option) |
82-
| `to_columns/1` | Execute and return column map |
83-
| `peek/2` | Print formatted table preview |
84-
| `n_rows/1` | Count rows |
85-
| `sql_preview/2` | Show generated SQL (`pretty: true` option) |
86-
87-
The `_with` variants (`filter_with/2`, `mutate_with/2`, `summarise_with/2`) accept raw SQL strings for programmatic use.
66+
The `_with` variants accept raw DuckDB SQL for anything the macro doesn't cover:
67+
68+
```elixir
69+
Dux.mutate_with(df, rank: "ROW_NUMBER() OVER (PARTITION BY dept ORDER BY salary DESC)")
70+
```
8871

8972
## IO
9073

91-
DuckDB handles all file formats and remote access natively:
74+
Read and write CSV, Parquet, NDJSON, Excel, and database tables:
9275

9376
```elixir
94-
# Read
95-
Dux.from_csv("data.csv", delimiter: "\t")
96-
Dux.from_parquet("data/**/*.parquet")
97-
Dux.from_ndjson("events.ndjson")
98-
Dux.from_query("SELECT * FROM read_parquet('s3://bucket/data.parquet')")
99-
100-
# Write
101-
Dux.to_csv(df, "output.csv")
102-
Dux.to_parquet(df, "output.parquet", compression: :zstd)
103-
Dux.to_ndjson(df, "output.ndjson")
77+
df = Dux.from_parquet("s3://bucket/data/**/*.parquet")
78+
df = Dux.from_csv("data.csv", delimiter: "\t")
79+
df = Dux.from_excel("sales.xlsx", sheet: "Q1")
80+
81+
Dux.to_parquet(df, "output/", partition_by: [:year, :month])
82+
Dux.to_excel(df, "report.xlsx")
83+
Dux.insert_into(df, "pg.public.events", create: true)
10484
```
10585

106-
S3, HTTP, Postgres, MySQL, SQLite — all via DuckDB extensions. No separate libraries needed.
86+
Cross-source queries via DuckDB's ATTACH — Postgres, MySQL, SQLite, Iceberg, Delta, DuckLake:
87+
88+
```elixir
89+
Dux.attach(:warehouse, "host=db.internal dbname=analytics", type: :postgres)
90+
customers = Dux.from_attached(:warehouse, "public.customers")
91+
92+
Dux.from_parquet("s3://lake/orders/*.parquet")
93+
|> Dux.join(customers, on: :customer_id)
94+
|> Dux.group_by(:region)
95+
|> Dux.summarise(revenue: sum(amount))
96+
|> Dux.to_rows()
97+
```
10798

108-
## Distributed queries
99+
## Distributed Execution
109100

110-
Dux distributes analytical workloads across a BEAM cluster:
101+
Mark a pipeline for distributed execution with `distribute/2`. The same verbs work — Dux handles partitioning, fan-out, and merge automatically:
111102

112103
```elixir
113-
# Workers auto-register via :pg
114104
workers = Dux.Remote.Worker.list()
115105

116-
# Mark for distributed, then use the same verbs
117-
Dux.from_parquet("data/**/*.parquet")
106+
Dux.from_parquet("s3://lake/events/**/*.parquet")
118107
|> Dux.distribute(workers)
119-
|> Dux.filter(amount > 100)
108+
|> Dux.filter(year == 2024)
120109
|> Dux.group_by(:region)
121-
|> Dux.summarise(total: sum(amount))
110+
|> Dux.summarise(total: sum(revenue))
122111
|> Dux.to_rows()
123112
```
124113

125-
No function serialization — `%Dux{}` is plain data. Ship it anywhere, compile to SQL there. No cluster manager — just `libcluster` + `:pg`. No heavyweight RPC — just `:erpc.multicall`.
114+
Under the hood: the Coordinator partitions files across workers (size-balanced, with Hive partition pruning), each worker compiles and executes SQL against its local DuckDB, and the Merger re-aggregates results. Workers read from and write to storage directly — no data funnels through the coordinator.
126115

127-
## Graph analytics
116+
Distributed writes work the same way:
128117

129118
```elixir
130-
graph = Dux.Graph.new(vertices: users, edges: follows)
131-
132-
# All algorithms are verb compositions
133-
graph |> Dux.Graph.pagerank() |> Dux.sort_by(desc: :rank) |> Dux.head(10)
134-
graph |> Dux.Graph.shortest_paths(start_node)
135-
graph |> Dux.Graph.connected_components()
136-
graph |> Dux.Graph.triangle_count()
137-
138-
# Distribute graph across workers
139-
graph |> Dux.Graph.distribute(workers) |> Dux.Graph.pagerank()
119+
Dux.from_parquet("s3://input/**/*.parquet")
120+
|> Dux.distribute(workers)
121+
|> Dux.filter(status == "active")
122+
|> Dux.to_parquet("s3://output/", partition_by: :year)
140123
```
141124

142-
## Nx interop
143-
144-
Numeric columns become tensors:
125+
Attach Postgres and distribute reads with `partition_by:`:
145126

146127
```elixir
147-
tensor = Dux.to_tensor(df, :price)
148-
# #Nx.Tensor<f64[1000] [...]>
128+
Dux.from_attached(:pg, "public.orders", partition_by: :id)
129+
|> Dux.distribute(workers)
130+
|> Dux.insert_into("pg.public.summary", create: true)
149131
```
150132

151-
`Dux` implements `Nx.LazyContainer` for use in `defn`.
133+
See the [Distributed Execution](https://hexdocs.pm/dux/distributed.html) guide for the full architecture — aggregate rewrites, broadcast vs shuffle joins, streaming merge, and fault tolerance.
152134

153-
## Raw SQL escape hatch
135+
## Graph Analytics
154136

155-
For anything the macro doesn't support — window functions, CASE WHEN, PIVOT, CTEs — use the `_with` variants with raw DuckDB SQL:
137+
A graph is two dataframes. All algorithms return `%Dux{}` — pipe into any verb:
156138

157139
```elixir
158-
# Window functions
159-
Dux.mutate_with(df, rank: "ROW_NUMBER() OVER (PARTITION BY \"dept\" ORDER BY \"salary\" DESC)")
140+
graph = Dux.Graph.new(vertices: users, edges: follows)
160141

161-
# CASE WHEN
162-
Dux.mutate_with(df, tier: "CASE WHEN amount > 1000 THEN 'high' ELSE 'low' END")
142+
graph |> Dux.Graph.pagerank() |> Dux.sort_by(desc: :rank) |> Dux.head(10)
143+
graph |> Dux.Graph.shortest_paths(start_node)
144+
graph |> Dux.Graph.connected_components()
145+
```
163146

164-
# Pivot
165-
Dux.from_query("PIVOT sales ON product USING SUM(amount) GROUP BY region")
147+
## Guides
166148

167-
# Any DuckDB SQL
168-
Dux.from_query("SELECT * FROM read_parquet('s3://bucket/data.parquet') WHERE year = 2025")
169-
```
149+
- [Getting Started](https://hexdocs.pm/dux/getting-started.html) — core concepts, expressions, pipelines
150+
- [Data IO](https://hexdocs.pm/dux/data-io.html) — CSV, Parquet, Excel, NDJSON, database writes
151+
- [Transformations](https://hexdocs.pm/dux/transformations.html) — filter, mutate, window functions
152+
- [Joins & Reshape](https://hexdocs.pm/dux/joins-and-reshape.html) — join types, ASOF joins, pivots
153+
- [Distributed Execution](https://hexdocs.pm/dux/distributed.html) — architecture, partitioning, distributed IO
154+
- [Graph Analytics](https://hexdocs.pm/dux/graph-analytics.html) — PageRank, shortest paths, components
155+
- [Cheatsheet](https://hexdocs.pm/dux/cheatsheet.html) — quick reference for all verbs
170156

171157
## License
172158

173159
Dual-licensed under Apache 2.0 and MIT. See [LICENSE-APACHE](LICENSE-APACHE) and [LICENSE-MIT](LICENSE-MIT).
174160

175161
## Links
176162

177-
- [Documentation](https://hexdocs.pm/dux)
163+
- [HexDocs](https://hexdocs.pm/dux)
164+
- [Hex.pm](https://hex.pm/packages/dux)
178165
- [GitHub](https://github.com/elixir-dux/dux)
179166
- [Changelog](CHANGELOG.md)

guides/data-io.livemd

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
# Data IO
22

33
```elixir
4-
Mix.install([{:dux, github: "elixir-dux/dux"}])
4+
Mix.install([{:dux, "~> 0.2.0"}])
55
```
66

77
## Setup

guides/getting-started.livemd

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
# Getting Started
22

33
```elixir
4-
Mix.install([{:dux, github: "elixir-dux/dux"}])
4+
Mix.install([{:dux, "~> 0.2.0"}])
55
```
66

77
## Meet the Penguins

guides/graph-analytics.livemd

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
# Graph Analytics
22

33
```elixir
4-
Mix.install([{:dux, github: "elixir-dux/dux"}])
4+
Mix.install([{:dux, "~> 0.2.0"}])
55
```
66

77
## Graphs Are Two DataFrames

guides/joins-and-reshape.livemd

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
# Joins & Reshape
22

33
```elixir
4-
Mix.install([{:dux, github: "elixir-dux/dux"}])
4+
Mix.install([{:dux, "~> 0.2.0"}])
55
```
66

77
## NYC Flights: A Star Schema

guides/transformations.livemd

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
# Transformations
22

33
```elixir
4-
Mix.install([{:dux, github: "elixir-dux/dux"}])
4+
Mix.install([{:dux, "~> 0.2.0"}])
55
```
66

77
## Setup

mix.exs

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
defmodule Dux.MixProject do
22
use Mix.Project
33

4-
@version "0.1.2-dev"
4+
@version "0.2.0"
55
@source_url "https://github.com/elixir-dux/dux"
66

77
def project do

0 commit comments

Comments
 (0)