Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 0 additions & 21 deletions .cursor/rules/system-prompt.mdc

This file was deleted.

136 changes: 54 additions & 82 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,80 +1,55 @@
# DataFusion Fuzzer

> **🚧 Work In Progress**
>
> This project is still under active development. The following documentation is AI-generated and requires future cleanup and validation.
>
> This is a Rust rewrite of [datafusion-sqlancer](https://github.com/apache/datafusion/issues/11030), originally implemented in Java. The rewrite aims to simplify implementation, enable better integration with existing DataFusion tooling, and make test oracles applicable to `sqllogictests`. See [this issue](https://github.com/apache/datafusion/issues/14535) for more details on the motivation behind the Rust rewrite.
A fuzzing tool for Apache DataFusion that tests SQL query execution and helps find potential bugs, crashes, and inconsistencies in query results.

A comprehensive fuzzing tool for Apache DataFusion, designed to test SQL query execution and find potential bugs, crashes, or inconsistencies in the query engine.
## Overview
This fuzzer primarily:
1. Generates random tables and SQL queries.
2. Runs them on DataFusion and checks whether the results satisfy an oracle-defined consistency rule.

## Quick Start

To run the fuzzer with default settings:

```bash
cargo run --release
```

To run with a custom configuration:
### Example
```text
Oracle: TLP (Ternary Logic Partitioning)

```bash
cargo run --release -- --config datafusion-fuzzer.toml
```
Random query (Q1):
SELECT * FROM t1;

To run with command-line options:
```bash
cargo run --release -- --config datafusion-fuzzer.toml --rounds 5 --queries-per-round 20
```
Mutated query (Q2):
SELECT * FROM t1 WHERE v1 > 0
UNION ALL
SELECT * FROM t1 WHERE NOT (v1 > 0)
UNION ALL
SELECT * FROM t1 WHERE (v1 > 0) IS NULL;

To run with verbose oracle/query logs to stdout:
```bash
RUST_LOG=info cargo run -- --config datafusion-fuzzer.toml --display-logs
Consistency check:
Q1 and Q2 should return the same multiset of rows.
```

## Oracles

The runner currently chooses one oracle at random for each test case:

- `NoCrashOracle`: checks for non-whitelisted crashes/errors.
- `TlpWhereOracle`: validates TLP partitioning over `WHERE` (`p`, `NOT p`, `p IS NULL`) via value-level multiset comparison.
- `TlpHavingOracle`: validates TLP partitioning over `HAVING` (`p`, `NOT p`, `p IS NULL`) via value-level multiset comparison.

## Configuration
This project is inspired by [SQLancer](https://github.com/sqlancer/sqlancer).

The fuzzer supports extensive configuration options to customize the fuzzing process.
For an introduction to database fuzzing techniques, see this talk by the author of SQLancer: https://youtu.be/Np46NQ6lqP8?si=lSVAU7Jy3H-QtrWV

You can configure DataFusion Fuzzer in two ways:
## Quick Start

1. **Configuration file**: Use a TOML file to specify detailed settings
2. **Command-line arguments**: Override configuration file settings or use standalone
To run the fuzzer with the default sample configuration:

### Configuration File
```bash
cargo run --release -- --config fuzzer-default.toml
```

See `datafusion-fuzzer.toml` for an example configuration file:
This runs the fuzzer against the DataFusion version specified in `Cargo.toml`.

```toml
# Fuzzing execution settings
seed = 42
rounds = 3
queries_per_round = 10
timeout_seconds = 2
The config file controls options such as round count, timeout, and log directory.

# Logging settings
display_logs = false
enable_tui = true
log_path = "logs"
sample_interval_secs = 5
If a bug is found, use the CLI output and generated log files to reproduce it.

# Table generation parameters
max_column_count = 5
max_row_count = 100
max_expr_level = 3
max_group_by_count = 3
max_table_count = 3
max_insert_per_table = 20
To override values from the configuration file by using CLI arguments:
```bash
cargo run --release -- --config fuzzer-default.toml --rounds 5 --queries-per-round 20
```

See `fuzzer-default.toml` for supported options.

### Command Line Options

```
Expand All @@ -91,27 +66,28 @@ Options:
-V, --version Print version
```

### Configuration Parameters
## Roadmap

### Implemented Oracles
The runner currently chooses one oracle at random for each test case:

- `max_table_count`: Maximum number of tables that can be selected in a single query (default: 3)
- `max_column_count`: Maximum number of columns per generated table (default: 5)
- `max_row_count`: Maximum number of rows per generated table (default: 100)
- `max_expr_level`: Maximum expression nesting level (default: 3)
- `max_group_by_count`: Maximum number of `GROUP BY` expressions (default: 3)
- [x] `NoCrashOracle`: checks for non-whitelisted crashes and errors.
- [x] `TlpWhereOracle`: validates TLP partitioning over `WHERE` (`p`, `NOT p`, `p IS NULL`) using value-level multiset comparison.
- [x] `TlpHavingOracle`: validates TLP partitioning over `HAVING` (`p`, `NOT p`, `p IS NULL`) using value-level multiset comparison.
- [ ] `NoREC` (planned): [paper](https://www.manuelrigger.at/preprints/NoREC.pdf)

## Progress Tracker
### SQL Features
- [x] where
- [ ] sort + limit, offset
- [ ] aggregate
- [x] having
- [ ] join
- [ ] union/union all/intersect/except

### SQL - Subqueries
- [ ] views
- [ ] scalar subquery
- [ ] 'relation-like' subquery
- [x] WHERE
- [ ] SORT + LIMIT/OFFSET
- [ ] AGGREGATE
- [x] HAVING
- [ ] JOIN
- [ ] UNION/UNION ALL/INTERSECT/EXCEPT

### SQL Subqueries
- [ ] Views
- [ ] Scalar subquery
- [ ] `Relation-like` subquery

### Expressions
- [ ] Operators
Expand All @@ -120,15 +96,11 @@ Options:
- [ ] Window Functions

### Types
- [ ] Complete Primitive types
- [ ] Complete primitive type coverage
- [ ] Time-related types
- [ ] Array types
- [ ] Struct/Json
- [ ] Struct/JSON

### Infrastructure
- [x] CLI
- [x] Oracle interface

## License

[MIT](LICENSE)
File renamed without changes.
Loading