Skip to content

Commit caea929

Browse files
authored
docs: add Jupyter notebook support documentation (apache#1399)
1 parent bbf9d30 commit caea929

1 file changed

Lines changed: 99 additions & 0 deletions

File tree

docs/source/user-guide/python.md

Lines changed: 99 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -141,6 +141,105 @@ assert result.column(0) == pyarrow.array([5, 7, 9])
141141
assert result.column(1) == pyarrow.array([-3, -3, -3])
142142
```
143143

144+
## Jupyter Notebook Support
145+
146+
Ballista works well in Jupyter notebooks. DataFrames automatically render as formatted HTML tables when displayed
147+
in a notebook cell.
148+
149+
### Basic Usage
150+
151+
```python
152+
from ballista import BallistaSessionContext
153+
154+
# Connect to a Ballista cluster
155+
ctx = BallistaSessionContext("df://localhost:50050")
156+
157+
# Register a table
158+
ctx.register_parquet("trips", "/path/to/nyctaxi.parquet")
159+
160+
# Run a query - the result renders as an HTML table
161+
ctx.sql("SELECT * FROM trips LIMIT 10")
162+
```
163+
164+
When a DataFrame is the last expression in a cell, Jupyter automatically calls its `_repr_html_()` method,
165+
which renders a styled table with:
166+
167+
- Formatted column headers
168+
- Expandable cells for long text content
169+
- Scrollable display for wide tables
170+
171+
### Converting Results
172+
173+
DataFrames can be converted to various formats for further analysis:
174+
175+
```python
176+
df = ctx.sql("SELECT * FROM trips WHERE fare_amount > 50")
177+
178+
# Convert to Pandas DataFrame
179+
pandas_df = df.to_pandas()
180+
181+
# Convert to PyArrow Table
182+
arrow_table = df.to_arrow_table()
183+
184+
# Convert to Polars DataFrame
185+
polars_df = df.to_polars()
186+
187+
# Collect as PyArrow RecordBatches
188+
batches = df.collect()
189+
```
190+
191+
### Example Notebook Workflow
192+
193+
A typical notebook workflow might look like:
194+
195+
```python
196+
# Cell 1: Setup
197+
from ballista import BallistaSessionContext
198+
from datafusion import col, lit
199+
200+
ctx = BallistaSessionContext("df://localhost:50050")
201+
ctx.register_parquet("orders", "/data/orders.parquet")
202+
ctx.register_parquet("customers", "/data/customers.parquet")
203+
204+
# Cell 2: Explore the data
205+
ctx.sql("SELECT * FROM orders LIMIT 5")
206+
207+
# Cell 3: Run analysis
208+
df = ctx.sql("""
209+
SELECT
210+
c.name,
211+
COUNT(*) as order_count,
212+
SUM(o.amount) as total_spent
213+
FROM orders o
214+
JOIN customers c ON o.customer_id = c.id
215+
GROUP BY c.name
216+
ORDER BY total_spent DESC
217+
LIMIT 10
218+
""")
219+
df
220+
221+
# Cell 4: Convert to Pandas for visualization
222+
import matplotlib.pyplot as plt
223+
224+
pandas_df = df.to_pandas()
225+
pandas_df.plot(kind='bar', x='name', y='total_spent')
226+
plt.show()
227+
```
228+
229+
### Running a Local Cluster in a Notebook
230+
231+
For development and testing, you can start a local cluster directly from a notebook:
232+
233+
```python
234+
from ballista import BallistaSessionContext, setup_test_cluster
235+
236+
# Start a local scheduler and executor
237+
host, port = setup_test_cluster()
238+
239+
# Connect to it
240+
ctx = BallistaSessionContext(f"df://{host}:{port}")
241+
```
242+
144243
## User Defined Functions
145244

146245
The underlying DataFusion query engine supports Python UDFs but this functionality has not yet been implemented in

0 commit comments

Comments
 (0)