Skip to content

Commit e63c82b

Browse files
cigraingerclaude
andauthored
docs: comprehensive distributed execution guide + docs uplift (#24)
* docs: comprehensive docs uplift for recent features New guide: Distributed IO — covers size-balanced reads, partition pruning, Postgres hash-partitioned reads, DuckLake file manifest, parallel writes, partition_by output, distributed insert_into. Includes mermaid architecture diagrams and data flow charts. Updated guides: - data-io: add Excel IO, insert_into, database writes - joins-and-reshape: add ASOF JOIN section, cross-source join tip - cheatsheet: add ~15 missing functions (attach/detach, from_excel, to_excel, insert_into, partition_by, ASOF join, secrets, from_attached, distributed Postgres reads, distributed writes) Updated mix.exs to include new guide in ExDoc extras. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: comprehensive distributed execution guide + docs uplift Replace the thin distributed-queries.livemd and separate distributed-io.md with a single comprehensive distributed.md guide covering the full story: - Architecture (Coordinator → PipelineSplitter → Partitioner → Workers → Merger) - Query decomposition with sequence diagram - Pipeline splitting: worker-safe vs coordinator-only ops - Aggregate rewrites (AVG→SUM/COUNT, STDDEV→Welford, COUNT DISTINCT→HLL) - Streaming merger and lattice compatibility - Data partitioning: size-balanced, Hive pruning, Postgres hash reads - Source safety classification table - Joins: broadcast (< 256MB) and shuffle (4-phase hash exchange) - Distributed writes: parallel files, partition_by, insert_into - Performance considerations and common pitfalls - Fault tolerance summary - Telemetry events table Also updates data-io, joins-and-reshape, cheatsheet guides with Excel IO, ASOF JOIN, insert_into, attach/from_attached, partition_by, secrets. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: fix 'connected by BEAM' → 'connected by the BEAM' Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 3fa3297 commit e63c82b

8 files changed

Lines changed: 574 additions & 162 deletions

guides/cheatsheet.cheatmd

Lines changed: 54 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,23 @@ Dux.from_csv("data.csv")
1313
Dux.from_csv("data.csv", delimiter: "\t", nullstr: "NA")
1414
Dux.from_parquet("data/**/*.parquet")
1515
Dux.from_ndjson("events.ndjson")
16+
Dux.from_excel("data.xlsx")
17+
Dux.from_excel("data.xlsx", sheet: "Sales", all_varchar: true)
18+
```
19+
20+
### From databases
21+
```elixir
22+
Dux.attach(:pg, "host=... dbname=db", type: :postgres)
23+
Dux.from_attached(:pg, "public.orders")
24+
Dux.from_attached(:pg, "public.orders", partition_by: :id)
25+
Dux.detach(:pg)
26+
Dux.list_attached()
27+
```
28+
29+
### Secrets
30+
```elixir
31+
Dux.create_secret(:s3, type: :s3, key_id: "...", secret: "...", region: "us-east-1")
32+
Dux.drop_secret(:s3)
1633
```
1734

1835
### From SQL
@@ -132,6 +149,12 @@ Dux.join(flights, airports, on: [{:dest, :faa}])
132149
Dux.join(orders, users, on: [{:customer_id, :id}])
133150
```
134151

152+
### ASOF join (time series)
153+
```elixir
154+
Dux.asof_join(trades, quotes, on: :symbol, by: {:timestamp, :>=})
155+
Dux.asof_join(trades, quotes, on: :symbol, by: {:timestamp, :>=}, how: :left)
156+
```
157+
135158
### Concat rows (UNION ALL)
136159
```elixir
137160
Dux.concat_rows([df1, df2, df3])
@@ -176,7 +199,15 @@ Dux.from_query("SELECT * FROM 'file.csv'")
176199
Dux.to_csv(df, "out.csv")
177200
Dux.to_parquet(df, "out.parquet")
178201
Dux.to_parquet(df, "out.parquet", compression: :zstd)
202+
Dux.to_parquet(df, "out/", partition_by: [:year, :month])
179203
Dux.to_ndjson(df, "out.ndjson")
204+
Dux.to_excel(df, "out.xlsx")
205+
```
206+
207+
### Database writes
208+
```elixir
209+
Dux.insert_into(df, "my_table", create: true)
210+
Dux.insert_into(df, "pg.public.events")
180211
```
181212

182213
## Materialization
@@ -194,22 +225,41 @@ Dux.sql_preview(df, pretty: true) # → formatted SQL
194225

195226
## Distributed
196227

228+
### Reads
197229
```elixir
198-
# Discover or start workers
199230
workers = Dux.Remote.Worker.list()
200231

201-
# Same verbs, automatically distributed
232+
# Size-balanced Parquet distribution
202233
Dux.from_parquet("s3://data/**/*.parquet")
203234
|> Dux.distribute(workers)
204235
|> Dux.filter(amount > 100)
205236
|> Dux.group_by(:region)
206237
|> Dux.summarise(total: sum(amount))
207238
|> Dux.to_rows()
208239

209-
# Collect back to local %Dux{}
240+
# Hash-partitioned Postgres reads
241+
Dux.from_attached(:pg, "public.orders", partition_by: :id)
242+
|> Dux.distribute(workers)
243+
|> Dux.to_rows()
244+
```
245+
246+
### Writes
247+
```elixir
248+
# Parallel file writes
249+
df |> Dux.distribute(workers) |> Dux.to_parquet("s3://out/")
250+
251+
# Hive-partitioned output
252+
df |> Dux.distribute(workers) |> Dux.to_parquet("s3://out/", partition_by: :year)
253+
254+
# Parallel database inserts
255+
df |> Dux.distribute(workers) |> Dux.insert_into("pg.public.events", create: true)
256+
257+
# Collect back to local
210258
df |> Dux.distribute(workers) |> Dux.collect()
259+
```
211260

212-
# FLAME: elastic cloud compute
261+
### FLAME: elastic cloud compute
262+
```elixir
213263
Dux.Flame.start_pool(backend: {FLAME.FlyBackend, ...}, max: 10)
214264
workers = Dux.Flame.spin_up(5)
215265
```

guides/data-io.livemd

Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -97,6 +97,60 @@ Dux.Datasets.penguins()
9797
"#{div(File.stat!(parquet_path).size, 1024)} KB"
9898
```
9999

100+
## Reading Excel
101+
102+
DuckDB 1.5+ reads `.xlsx` files natively. Dux defaults to `ignore_errors: true`
103+
and `empty_as_varchar: true` for safe handling of messy spreadsheets:
104+
105+
```elixir
106+
# xlsx_path = "sales.xlsx"
107+
# Dux.from_excel(xlsx_path) |> Dux.to_rows()
108+
#
109+
# # With options
110+
# Dux.from_excel("data.xlsx", sheet: "Q1 2024", range: "A1:F100")
111+
#
112+
# # For messy spreadsheets with mixed types
113+
# Dux.from_excel("messy.xlsx", all_varchar: true)
114+
```
115+
116+
## Writing Excel
117+
118+
```elixir
119+
# excel_out = Path.join(tmp_dir, "output.xlsx")
120+
# Dux.Datasets.penguins()
121+
# |> Dux.filter_with("species = 'Gentoo'")
122+
# |> Dux.to_excel(excel_out)
123+
```
124+
125+
## Database Tables: `insert_into`
126+
127+
Write pipeline results to a table — local DuckDB or an attached database:
128+
129+
```elixir
130+
# Create a local table from a pipeline
131+
Dux.from_query("SELECT * FROM range(100) t(x)")
132+
|> Dux.insert_into("my_table", create: true)
133+
134+
# Read it back
135+
Dux.from_query("SELECT * FROM my_table") |> Dux.n_rows()
136+
```
137+
138+
```elixir
139+
# Cleanup
140+
conn = Dux.Connection.get_conn()
141+
Adbc.Connection.query(conn, "DROP TABLE IF EXISTS my_table")
142+
```
143+
144+
> #### Attached databases {: .info}
145+
>
146+
> `insert_into` works with attached databases too:
147+
> ```elixir
148+
> Dux.attach(:pg, "host=... dbname=analytics", type: :postgres, read_only: false)
149+
> Dux.from_parquet("data.parquet")
150+
> |> Dux.insert_into("pg.public.events", create: true)
151+
> ```
152+
> See the [Distributed Execution](distributed.md) guide for parallel writes.
153+
100154
## The SQL Escape Hatch
101155
102156
`from_query/1` lets you write raw DuckDB SQL for anything the verbs don't cover:

guides/distributed-queries.livemd

Lines changed: 0 additions & 153 deletions
This file was deleted.

0 commit comments

Comments
 (0)