Library of SKILLs for writing rules #24812

Dev-iL · 2026-04-24T10:31:33Z

Dev-iL
Apr 24, 2026

I recently worked on several AIR rules. While doing so, I developed a SKILL for creating additional rules of this family, incorporating feedback and best practices. My question is: would this be something worthwhile to add to the Ruff repo itself (under .claude/skills/)?

I imagine other developers have similar SKILLs they use - in which case it would make a lot of sense to consolidate them in a centralized location. This has the potential of making everyone's lives easier since rule additions/amendments would presumably start off at a higher baseline that's aligned with the repo's best practices.

Regarding the maintenance of such SKILLs, I'd add a note to the contribution guidelines or even PR template, along the lines of "If a dedicated SKILL was used to create this rule, please consider asking your LLM to incorporate any PR feedback into the SKILL, so that insights can be reuse by others".

ntBre · 2026-04-24T18:16:01Z

ntBre
Apr 24, 2026
Maintainer

Thanks for opening this discussion. It might be interesting to share the skill before deciding if we want to add it to the repo. I'm curious how it differs from the existing contribution guidelines for adding new rules and if we should just update those instead.

5 replies

Dev-iL Apr 24, 2026
Author

Sure thing, you can see it here: https://gist.github.com/Dev-iL/3ee77257fff578b884a0f7492d082093

MichaReiser Apr 24, 2026
Maintainer

I think we'd have to narrow that significantly to avoid it being something that needs continuous updating. It's also too specific in my view.

Dev-iL Apr 24, 2026
Author

Well, it was created specifically for airflow rules after all. I believe it can be split into general + airflow-specific parts quite easily. The bigger question is whether it's worth it.

MichaReiser Apr 24, 2026
Maintainer

The way to measure this is to give an AI the same task but one has access to the skill and the other don't. Is the result of the one using the skill significantely better?

Dev-iL Apr 24, 2026
Author

I guess I'll have to try that when I implement my next rule. Perhaps @Lee-W @sjyangkevin or @dstandish used agents and can comment.

This will be a harder comparison to make today since there are more rules (i.e. examples) now then there were before I added my first rule - so it will have to be done on an earlier version of the repo.

For now, you can compare the earliest rule I submitted, to the latest one - which used a more recent version of the skill, and see if the initial version is more polished.

Dev-iL · 2026-05-13T09:26:02Z

Dev-iL
May 13, 2026
Author

I've done an experiment to gauge this.

Setup

Claude Opus 4.7 on medium effort.
Skill present vs absent.
Start from a planning stage then proceed to implement using auto mode.
Prompt

Implement a new ruff rule for apache airflow called AIR202. This rule processes @task-decorated functions that meet at least one of the below:

the decorated method returns a subclass of collections.abc.Mapping. We should look for return statements followed by dict definition patterns common in airflow, such as dict comprehension or an inline dict.

the method has a return type hint which is a subclass of collections.abc.Mapping
The rule then flags these functions if the multiple_outputs parameter is not specified.

An unsafe autofix should always be available - adding either multiple_outputs=True or multiple_outputs=False kwarg to the @task decorator based on airflow's internal inference rules.

Context: The reason for this rule is to decouple the presence of type hints from XCom functionality. Applying the rule will have several benefits: 1) Make the intent of the Dag author explicit; 2) Robustify the code to future changes modifying the inference logic; 3) Increase awareness of multiple_outputs.

path to a local clone of airflow: @../airflow/

relevant docs:
@../airflow/airflow-core/docs/tutorial/taskflow.rst

multiple_outputs inference logic:
Lines 476-496 in @../airflow/task-sdk/src/airflow/sdk/bases/decorator.py
example:
@task
def a(): # multiple_outputs = False
return {"x": 1}

@task
def b() -> dict: # multiple_outputs = True ← inferred
return {"x": 1}

Plans

No-SKILL Plan

# AIR202 — Explicit `multiple_outputs` for `@task` returning a Mapping

## Context

Apache Airflow's `@task` decorator infers `multiple_outputs` from the return type annotation: if the return type is a `Mapping` subclass, return-value keys become individual XComs; otherwise the entire value is stored under a single XCom. This implicit coupling between *type annotations* and *runtime XCom semantics* surprises Dag authors and is fragile to future changes in the inference logic (see `airflow/task-sdk/src/airflow/sdk/bases/decorator.py:476-496`).

AIR202 flags `@task`-decorated functions that look like they return a mapping but do not explicitly pass `multiple_outputs=...`. The autofix adds `multiple_outputs=True/False` based on the same inference Airflow uses, preserving runtime behavior while making intent explicit.

The rule fires when **either** of these holds AND `multiple_outputs` is not already set on the decorator:

1. The function has a return type **annotation** whose name is in the Mapping family (see Detection below).
2. The function body contains a `return` statement whose value is a **dict literal** (`{"k": v, ...}`) or a **dict comprehension** (`{k: v for ...}`).

## Detection details

### Decorator targets

All `@task.*` forms (per user choice). Reuse `is_airflow_task()` from `crates/ruff_linter/src/rules/airflow/helpers.rs:307-328` — it already matches `@task`, `@task()`, and `@task.<variant>` for both `airflow.decorators.task` and `airflow.sdk.task`. The rule must also skip if any decorator already passes `multiple_outputs` as a keyword.

### Return-annotation match (Mapping family)

Resolve the annotation via `semantic.resolve_qualified_name`. If the annotation is a subscript (`dict[str, int]`), resolve the *value* (`dict`) part. Match any of:

- `["dict"]` (builtin)
- `["typing", "Dict"]`, `["typing_extensions", "Dict"]`
- `["typing", "Mapping"]`, `["typing_extensions", "Mapping"]`
- `["typing", "MutableMapping"]`, `["typing_extensions", "MutableMapping"]`
- `["collections", "abc", "Mapping"]`
- `["collections", "abc", "MutableMapping"]`
- `["collections", "OrderedDict"]`, `["typing", "OrderedDict"]`

### Return-value match

Walk the function body with `ReturnStatementVisitor` (used in `task_branch_as_short_circuit.rs`). A return statement matches when `ret.value` is `Expr::Dict(_)` or `Expr::DictComp(_)`. Do not descend into nested function/class scopes (the visitor already respects this).

### Autofix

Always available, **unsafe**. Determine the value Airflow *would* infer:

- If the function has a return annotation in the Mapping family → `multiple_outputs=True`.
- Otherwise (no annotation, or annotation isn't Mapping family) → `multiple_outputs=False`.

This mirrors `_infer_multiple_outputs` in `airflow/task-sdk/src/airflow/sdk/bases/decorator.py:476-496` and preserves runtime behavior across the fix.

Edit construction:

- `@task` (no call) → replace decorator expression with `@task(multiple_outputs=...)`. Use `Edit::range_replacement` over `decorator.expression.range()`.
- `@task()` (empty call) → insert `multiple_outputs=...` inside the parens. Use `Edit::insertion` at the position just after `(`.
- `@task(retries=3, ...)` → append `, multiple_outputs=...` before the closing `)`. Use `Edit::insertion` just before the call's `)` token (find via `call.arguments.range().end() - 1` or scan for `)`).
- Variant forms (`@task.virtualenv`, etc.) → same three cases, applied to the variant call.

Report range: the decorator that triggers the rule.

Diagnostic: `Fix::unsafe_edit(...)` with title "Add explicit `multiple_outputs=True`" or `"...=False"`.

## Files to create

- `crates/ruff_linter/src/rules/airflow/rules/task_explicit_multiple_outputs.rs` — the rule.
- `crates/ruff_linter/resources/test/fixtures/airflow/AIR202.py` — fixture covering:
  - `@task` returning dict literal (no annotation) → `multiple_outputs=False`
  - `@task` with `-> dict` annotation, body returns dict → `multiple_outputs=True`
  - `@task` with `-> dict[str, int]` annotation → `multiple_outputs=True`
  - `@task` with `-> Mapping[str, int]` (import from `typing`) → `multiple_outputs=True`
  - `@task` with `-> collections.abc.Mapping` → `multiple_outputs=True`
  - `@task()` form with empty parens
  - `@task(retries=3)` with existing kwargs
  - `@task.virtualenv(...)` variant
  - Dict comprehension return
  - **Negative**: `@task(multiple_outputs=True)` already specified → no diagnostic
  - **Negative**: `@task` returning list/int/str with no Mapping annotation → no diagnostic
  - **Negative**: function not decorated with `@task` → no diagnostic
  - **Negative**: dict literal inside nested function inside the task → no diagnostic
- The snapshot file is auto-generated by `cargo nextest run` with `INSTA_UPDATE=always`.

## Files to modify

- `crates/ruff_linter/src/rules/airflow/rules/mod.rs` — add `mod task_explicit_multiple_outputs;` and `pub(crate) use task_explicit_multiple_outputs::*;`.
- `crates/ruff_linter/src/codes.rs` (around line 1139, after the AIR201 entry) — add `(Airflow, "202") => rules::airflow::rules::AirflowTaskMissingMultipleOutputs,`.
- `crates/ruff_linter/src/checkers/ast/analyze/statement.rs` (around line 342 where `AirflowTaskBranchAsShortCircuit` is dispatched) — add dispatch on `StmtFunctionDef`:
  ```rust
  if checker.is_rule_enabled(Rule::AirflowTaskMissingMultipleOutputs) {
      airflow::rules::task_explicit_multiple_outputs(checker, function_def);
  }
  ```
- `crates/ruff_linter/src/rules/airflow/mod.rs` — add a `#[test_case(Rule::AirflowTaskMissingMultipleOutputs, Path::new("AIR202.py"))]` entry.

## Rule struct (sketch)

```rust
/// ## What it does
/// Checks for `@task`-decorated functions that appear to return a `Mapping`
/// (dict literal, dict comprehension, or an annotated return type in the
/// `Mapping` family) without explicitly specifying `multiple_outputs`.
///
/// ## Why is this bad?
/// Airflow infers `multiple_outputs` from the return type annotation: if the
/// annotation is a `Mapping` subclass, return-value keys are split into
/// individual XComs; otherwise the entire value is stored as one XCom. This
/// implicit coupling between type annotations and XCom semantics is
/// surprising and fragile. Passing `multiple_outputs` explicitly makes the
/// intent clear and decouples the behavior from the annotation.
///
/// ## Example
/// ```python
/// @task
/// def my_task() -> dict:
///     return {"x": 1}
/// ```
///
/// Use instead:
/// ```python
/// @task(multiple_outputs=True)
/// def my_task() -> dict:
///     return {"x": 1}
/// ```
///
/// ## Fix safety
/// The fix is always unsafe: it pins the value Airflow would have inferred
/// today, but if the function's behavior was relying on a different
/// inference (or future inference change) the resulting XCom shape may
/// change.
#[derive(ViolationMetadata)]
#[violation_metadata(preview_since = "0.15.13")]  // bump to current preview version
pub(crate) struct AirflowTaskMissingMultipleOutputs {
    inferred: bool,
}
```

`Violation` impl returns
`"`multiple_outputs` should be explicitly specified on `@task` returning a `Mapping`"`
and fix title `"Add `multiple_outputs={True|False}`"`.

## Reusable utilities

- `is_airflow_task` — `crates/ruff_linter/src/rules/airflow/helpers.rs:307`
- `ReturnStatementVisitor` — `ruff_python_ast::helpers::ReturnStatementVisitor` (used in `task_branch_as_short_circuit.rs:2`)
- `semantic.resolve_qualified_name` for annotation matching
- `Fix::unsafe_edit` / `Edit::range_replacement` / `Edit::insertion`

## Verification

```sh
# Build & focused test
CARGO_PROFILE_DEV_OPT_LEVEL=1 INSTA_FORCE_PASS=1 INSTA_UPDATE=always \
  CARGO_PROFILE_DEV_DEBUG="line-tables-only" MDTEST_UPDATE_SNAPSHOTS=1 \
  cargo nextest run -p ruff_linter -- airflow

# Review newly generated snapshot under
# crates/ruff_linter/src/rules/airflow/snapshots/ruff_linter__rules__airflow__tests__AIR202_AIR202.py.snap

# Hand-check on the airflow repo (will surface real-world cases)
cargo run --bin ruff -- check --select AIR202 --preview ../airflow

# Try the autofix on the fixture (a copy)
cp crates/ruff_linter/resources/test/fixtures/airflow/AIR202.py /tmp/AIR202_copy.py
cargo run --bin ruff -- check --select AIR202 --preview --fix --unsafe-fixes /tmp/AIR202_copy.py
diff crates/ruff_linter/resources/test/fixtures/airflow/AIR202.py /tmp/AIR202_copy.py

# Lints/format
cargo clippy --workspace --all-targets --all-features -- -D warnings
cargo dev generate-all   # regenerates schema + rule docs
uvx prek run -a
```

## Open / non-goals

- Does not attempt to follow type aliases (`MyDict = dict[str, int]`) — only matches annotation names directly.
- Does not detect dict-returning expressions other than dict literals and dict comprehensions (e.g., `return dict(...)`, `return some_var`). These can produce false negatives; deliberate per "common in airflow" framing.
- Does not check `multiple_outputs` passed via `.override(...)` chain — only the decorator itself.

SKILL-assisted Plan

# AIR202: Explicit `multiple_outputs` on `@task`-decorated functions

## Context

Airflow's `@task` decorator infers `multiple_outputs` from the return type annotation at runtime: if the annotation resolves to a subclass of `collections.abc.Mapping`, the function's return value is split into multiple XComs (one per key); otherwise it is stored as a single XCom. The inference happens in `_infer_multiple_outputs` at `task-sdk/src/airflow/sdk/bases/decorator.py:477-496`:

```python
ttype = getattr(return_type, "__origin__", return_type)
return isinstance(ttype, type) and issubclass(ttype, Mapping)
```

This couples *typing* to *runtime XCom behavior* in a non-obvious way. Renaming a type hint, removing it, or switching `dict` → `tuple` silently changes the DAG's XCom layout. AIR202 flags `@task`-decorated functions whose XCom behavior is determined by inference rather than by an explicit `multiple_outputs=...` kwarg, so authors:

1. Make their intent explicit.
2. Insulate their DAGs from future inference-logic changes.
3. Become aware that `multiple_outputs` exists.

## Rule summary

**Code:** `AIR202` (next free in the 2xx range — AIR201 is the last one used).
**Name:** `AirflowTaskMultipleOutputsImplicit` (reads "allow airflow task multiple outputs implicit").
**Category:** General best-practice (AIR0xx/2xx), targets both Airflow 2 and 3 (`airflow.decorators.task` and `airflow.sdk.task`).
**Fix availability:** Always available, **unsafe**.

### Trigger

A `FunctionDef` is flagged when **all** of the following hold:

1. It is decorated with `@task` or one of its supported variants (see "Decorator matching" below) — but *not* `@task.sensor`, which hardcodes `multiple_outputs=False`.
2. The decorator does **not** already pass `multiple_outputs` (in either `@task(multiple_outputs=...)` form or any of the variant subscripts).
3. At least one of:
   - **Annotation path:** The return type annotation resolves to a subclass of `collections.abc.Mapping` (e.g., `dict`, `dict[str, int]`, `typing.Dict`, `typing.Mapping`, `collections.abc.Mapping`, or a `TypedDict` subclass).
   - **Body path:** Some return statement in the function body returns an inline dict literal (`return {...}`), a dict comprehension (`return {k: v for ...}`), or a `dict(...)` call.

### Autofix (unsafe)

Insert a `multiple_outputs=` keyword argument into the decorator, mirroring Airflow's runtime inference:

- If the annotation path fires (Mapping-subclass annotation) → add `multiple_outputs=True`.
- Otherwise (body path only) → add `multiple_outputs=False` (matches what Airflow would have inferred at runtime; preserves current behavior).

Two decorator forms need to be handled:

- `@task(...)` (already a `Call`): use `crate::fix::edits::add_argument` (`crates/ruff_linter/src/fix/edits.rs:274`).
- `@task` / `@task.branch` etc. (bare `Name` or `Attribute`, no call): insert `(multiple_outputs=True/False)` at the end of the decorator expression's range via `Edit::insertion`.

The fix is **always unsafe** because (a) the user may rely on inference behavior elsewhere, and (b) we cannot statically prove the dict-return pattern matches *all* return paths (e.g., conditional returns of non-dicts).

### Decorator matching

Use the existing `is_airflow_task_variant` helper in `crates/ruff_linter/src/rules/airflow/helpers.rs` to check variants. The rule should iterate the function's decorators and accept any decorator that matches one of:

- `airflow.decorators.task` / `airflow.sdk.task` (bare or called)
- `task.python`, `task.virtualenv`, `task.external_python`, `task.branch`, `task.branch_virtualenv`, `task.branch_external_python`, `task.short_circuit`, `task.docker`, `task.kubernetes`, `task.pyspark` (attribute access on `task`)

Explicitly exclude `task.sensor`.

If `is_airflow_task_variant` doesn't already cover the multi-variant case, extend `helpers.rs` with a small `is_airflow_task_supporting_multiple_outputs(decorator, semantic) -> Option<&Decorator>` helper that returns the matched decorator (so the fix knows where to insert). Place it near `is_airflow_task` for discoverability.

### Mapping-annotation detection

Helper logic (private to the rule file):

1. Unwrap subscripts: for `dict[str, int]` look at the `value` (`dict`).
2. Resolve qualified name via `checker.semantic().resolve_qualified_name(...)`.
3. Return `true` if segments match any of:
   - `["", "dict"]` (builtin — check via `semantic().match_builtin_expr(expr, "dict")`).
   - `["typing", "Dict"]`, `["typing", "Mapping"]`, `["typing", "MutableMapping"]`, `["typing", "OrderedDict"]`, `["typing", "DefaultDict"]`, `["typing", "Counter"]`, `["typing", "ChainMap"]`, `["typing", "TypedDict"]` and `typing_extensions` equivalents — use `semantic().match_typing_expr(...)`.
   - `["collections", "abc", "Mapping" | "MutableMapping"]`.
   - `["collections", "OrderedDict" | "defaultdict" | "Counter" | "ChainMap"]`.
4. For class names (e.g., a user-defined `class MyDict(TypedDict): ...`), best-effort detection only — if the annotation is a local class whose bases include a `TypedDict` parent reachable via the semantic model, treat it as Mapping. If detection is uncertain, *do not* flag (avoids false positives). The body-path check still catches the common case.

### Body-return detection

Use `ReturnStatementVisitor` from `ruff_python_ast::helpers` to collect all `return` statements (handles nested ifs). A function matches the body path if **any** return value is:

- `Expr::Dict(_)` — inline `{...}` literal.
- `Expr::DictComp(_)` — dict comprehension.
- `Expr::Call(c)` where `c.func` resolves to the builtin `dict` (`semantic().match_builtin_expr(&c.func, "dict")`).

## Files to create / modify

### New
- `crates/ruff_linter/src/rules/airflow/rules/task_multiple_outputs_implicit.rs` — rule implementation.
- `crates/ruff_linter/resources/test/fixtures/airflow/AIR202.py` — test fixture with positive/negative cases (see "Fixture cases" below).

### Modified
- `crates/ruff_linter/src/codes.rs` — register `(Airflow, "202") => rules::airflow::rules::AirflowTaskMultipleOutputsImplicit` under the airflow section.
- `crates/ruff_linter/src/rules/airflow/rules/mod.rs` — add `pub(crate) use task_multiple_outputs_implicit::*;` and `mod task_multiple_outputs_implicit;` (alphabetical).
- `crates/ruff_linter/src/rules/airflow/mod.rs` — add `#[test_case(Rule::AirflowTaskMultipleOutputsImplicit, Path::new("AIR202.py"))]`.
- `crates/ruff_linter/src/checkers/ast/analyze/statement.rs` — dispatch on `Stmt::FunctionDef(function_def)` near the existing `AirflowTaskBranchAsShortCircuit` dispatch:
  ```rust
  if checker.is_rule_enabled(Rule::AirflowTaskMultipleOutputsImplicit) {
      airflow::rules::task_multiple_outputs_implicit(checker, function_def);
  }
  ```
- `crates/ruff_linter/src/rules/airflow/helpers.rs` — only if a new shared helper is needed for decorator matching (see "Decorator matching"); otherwise leave untouched.

### Reused utilities (no duplication)
- `crate::rules::airflow::helpers::is_airflow_task_variant` — variant detection.
- `crate::fix::edits::add_argument` — kwarg insertion for `@task(...)` calls (`crates/ruff_linter/src/fix/edits.rs:274`).
- `ruff_python_ast::helpers::{ReturnStatementVisitor, map_callable}` — body traversal and decorator unwrapping.
- `ruff_python_semantic::SemanticModel::{resolve_qualified_name, match_builtin_expr, match_typing_expr}` — type resolution.

## Diagnostic

- **Message (generic):** `` `@task`-decorated function relies on `multiple_outputs` inference ``.
- **Fix title (variant-specific):**
  - `` Add `multiple_outputs=True` `` (annotation path)
  - `` Add `multiple_outputs=False` `` (body-only path)

Use a `Kind` enum on the struct so the fix title can branch on the path that fired.

## Fixture cases (AIR202.py)

Positive (should flag):
- `@task` + `-> dict:` returning dict literal (annotation path; fix: True).
- `@task()` + `-> Mapping[str, int]:` (annotation path; fix: True).
- `@task` + no annotation + `return {"x": 1}` (body path; fix: False).
- `@task` + no annotation + dict comprehension return (body path; fix: False).
- `@task` + no annotation + `return dict(a=1)` (body path; fix: False).
- `@task.branch`, `@task.short_circuit`, `@task.virtualenv`, `@task.docker`, `@task.kubernetes` each with a dict literal return.
- Both `from airflow.decorators import task` and `from airflow.sdk import task` import forms.
- `class MyTD(TypedDict): ...` annotation (annotation path; fix: True).

Negative (should NOT flag):
- `@task(multiple_outputs=True)` and `@task(multiple_outputs=False)` — already explicit.
- `@task` returning a non-dict (e.g., `return 1`, `return [1,2]`, `return (1,2)`) with no Mapping annotation.
- `@task` with `-> list[int]:` annotation.
- `@task.sensor` returning a dict (sensor forbids `multiple_outputs`).
- Plain function (no `@task` decorator) returning a dict.
- `@task` returning `None` / no return statements.

## Verification

1. `cargo fmt -p ruff_linter`
2. Manual check:
   ```sh
   cargo run -p ruff -- check crates/ruff_linter/resources/test/fixtures/airflow/AIR202.py --no-cache --preview --select AIR202
   cargo run -p ruff -- check crates/ruff_linter/resources/test/fixtures/airflow/AIR202.py --no-cache --preview --select AIR202 --fix --unsafe-fixes --diff
   ```
3. Snapshot test:
   ```sh
   RUFF_UPDATE_SCHEMA=1 cargo nextest run -p ruff_linter -- "airflow::tests"
   cargo insta accept
   RUFF_UPDATE_SCHEMA=1 cargo nextest run -p ruff_linter -- "airflow::tests"
   ```
4. `cargo dev generate-all` to regenerate docs/schema.
5. `cargo clippy --workspace --all-targets --all-features -- -D warnings`.
6. `uvx prek run -a`.
7. Validate against the local Airflow clone for false positives:
   ```sh
   cargo run -p ruff -- check /home/iliya/repositories/airflow --no-cache --preview --select AIR202
   ```
   Review hits; if a false-positive pattern surfaces, add it as a negative case in the fixture and tighten the rule.

Implementations

The below are the "raw" outputs for when the agent thinks it's "done".

No-SKILL implementation: Dev-iL@b61319f
SKILL-assisted implementation: Dev-iL@c276a88

Comparison

Manual

My personal impression is that with the SKILL, the exploration of the Airflow codebase was more thorough, and the questions shown during the planning stage were more focused.

Automated (Codex)

Important

Mango is the no-SKILL plan and Goblet is the SKILL-assisted one.

This is the prompt given to two AI agents tasked with creating an implementation plan.
<prompt>
Attached are two plans for implementing the feature. Compare and contrast. Score the plans on correctness, thoroughness, alignment with the prompt, and other parameters you think matter. During the comparison refer to the plans as "Goblet" and "Mango".

Comparison of Plans

Direct answer

Goblet is the stronger plan overall. It is more faithful to the prompt, more complete on Ruff integration, better on test coverage, and more careful about preserving Airflow’s current runtime behavior in the autofix. I would use Goblet as the base implementation plan, with a few corrections.

Overall scores

Dimension	Goblet	Mango
Correctness vs prompt	8.5 / 10	6.5 / 10
Thoroughness	9 / 10	7 / 10
Ruff implementation fit	8.5 / 10	6.5 / 10
Autofix design	8 / 10	6.5 / 10
Test plan	9 / 10	7 / 10
False-positive / false-negative awareness	8 / 10	6 / 10
Naming / maintainability	8 / 10	6.5 / 10
Verification plan	9 / 10	7.5 / 10
Overall	8.6 / 10	6.7 / 10

Key differences

1. Trigger coverage

Goblet matches the prompt more completely. It says the rule should fire for @task-decorated functions without multiple_outputs when either the return annotation is Mapping-like or the body returns an inline dict, dict comprehension, or dict(...) call. That directly covers the prompt’s “dict definition patterns common in airflow” and gives a concrete additional dict(...) pattern.

Mango omits return dict(...). It only matches Expr::Dict and Expr::DictComp, and explicitly lists return dict(...) as a non-goal / false negative. That is a weaker match to the prompt, because the prompt asks for “dict definition patterns common in airflow” and gives dict comprehension and inline dict as examples, not an exhaustive list.

Winner: Goblet.

2. Autofix behavior

Both plans understand the core autofix rule correctly: add multiple_outputs=True when Airflow would infer true from the annotation; otherwise add multiple_outputs=False for body-only dict returns. Goblet states this clearly and ties it to preserving runtime behavior.

Goblet also recommends using Ruff’s existing add_argument helper for called decorators, which is safer and more idiomatic than hand-positioning commas and parentheses. Mango proposes manual insertion based on positions like “just before the call’s ) token,” which is more brittle, especially around comments, multiline decorators, formatting, and trailing commas.

Winner: Goblet.

3. Decorator matching

Goblet is more careful here. It explicitly includes core @task, Airflow 2 and 3 import paths, and common task variants such as task.virtualenv, task.branch, task.short_circuit, task.docker, task.kubernetes, and task.pyspark; it also explicitly excludes task.sensor.

Mango says “all @task.* forms” and suggests reusing is_airflow_task(), but that may be too broad because task.sensor is special and should not be treated like ordinary multiple-output-capable task decorators. Goblet’s explicit exclusion is important.

Winner: Goblet.

4. Mapping annotation detection

Goblet is much broader: it covers built-in dict, typing.Dict, typing.Mapping, MutableMapping, OrderedDict, DefaultDict, Counter, ChainMap, collections.abc.Mapping, and related forms. Mango covers the basic cases but misses several Mapping-like types that Airflow’s runtime issubclass(..., Mapping) logic would likely treat as Mapping subclasses.

However, Goblet may overreach on TypedDict. Airflow’s runtime inference is based on issubclass(ttype, Mapping), and TypedDict is not a normal runtime Mapping subclass in the same way as dict or collections.OrderedDict. Goblet’s recommendation to flag TypedDict subclasses should be verified against the exact Airflow inference behavior before implementation. Mango avoids that possible overreach, but it does not discuss the issue.

Winner: Goblet, with a caution on TypedDict.

5. Ruff integration details

Goblet gives a cleaner and more internally consistent integration plan: rule file, fixture, codes.rs, airflow rule module, airflow tests, AST statement dispatch, helper reuse, and verification commands.

Mango’s integration is serviceable, but the naming is less aligned with the rule’s purpose. AirflowTaskMissingMultipleOutputs sounds like it flags every task missing the parameter, even though the rule only applies to Mapping-returning / dict-returning tasks. Goblet’s AirflowTaskMultipleOutputsImplicit better captures the actual violation: relying on inference.

Winner: Goblet.

6. Test coverage

Goblet’s fixture plan is stronger. It includes annotation positives, body positives, dict(...), multiple task variants, both Airflow import paths, explicit multiple_outputs negatives, non-dict returns, task.sensor, plain functions, and no-return cases.

Mango includes useful cases too, especially decorator-call forms like @task(), @task(retries=3), and nested-function false positives. But it misses broader variant/import coverage and omits dict(...).

Winner: Goblet, though Mango has a good nested-function negative case that should be added to Goblet.

Main issues in each plan

Goblet issues to fix

Possible TypedDict overreach. Verify whether Airflow’s _infer_multiple_outputs actually returns true for a TypedDict subclass. If not, do not flag it via the annotation path.
Potentially too broad Mapping-family list. For each annotation type, compare against Airflow’s actual runtime getattr(return_type, "__origin__", return_type) plus issubclass(..., Mapping) behavior. The rule should mirror that as closely as Ruff can statically.
Name note typo. The description says AirflowTaskMultipleOutputsImplicit “reads ‘allow airflow task multiple outputs implicit’,” which looks like a typo. It should read more like “Airflow task multiple outputs implicit.”
Add Mango’s nested-function negative test. This is important to ensure dict returns inside nested functions/classes do not trigger the outer task.

Mango issues to fix

Missing dict(...) body detection. This is the biggest prompt-alignment gap.
Decorator matching is too loose. It says all @task.* forms but does not explicitly exclude task.sensor.
Manual fix insertion is brittle. Prefer Ruff’s existing edit helpers where available.
Rule name is misleading. AirflowTaskMissingMultipleOutputs suggests the rule applies to all tasks without multiple_outputs.
Fewer fixture cases. It lacks broader variant and import coverage.
Open/non-goals accept too many false negatives. Type aliases and arbitrary returned variables are reasonable non-goals, but excluding dict(...) is less defensible for this prompt.

Recommendation

Use Goblet as the base plan, then incorporate these Mango improvements:

Add a negative test for dict literals returned inside nested functions/classes.
Add explicit decorator-call-form tests: @task(), @task(retries=3), multiline decorators, and variant calls.
Before implementing TypedDict support, validate Airflow’s actual inference behavior; drop TypedDict if it does not mirror _infer_multiple_outputs.
Use Ruff’s add_argument or equivalent existing helper for called decorators rather than manual token-position editing.
Keep Goblet’s rule name or something similarly precise, e.g. AirflowTaskImplicitMultipleOutputs.

Bottom line: Goblet is more implementation-ready and more aligned with the requested AIR202 semantics. Mango is a decent simpler plan, but it leaves avoidable false negatives and has a more fragile autofix strategy.

0 replies

Uh oh!

Library of SKILLs for writing rules #24812

Uh oh!

Dev-iL Apr 24, 2026

Replies: 2 comments · 5 replies

Uh oh!

ntBre Apr 24, 2026 Maintainer

Uh oh!

Dev-iL Apr 24, 2026 Author

Uh oh!

MichaReiser Apr 24, 2026 Maintainer

Uh oh!

Uh oh!

Dev-iL Apr 24, 2026 Author

Uh oh!

MichaReiser Apr 24, 2026 Maintainer

Uh oh!

Dev-iL Apr 24, 2026 Author

Uh oh!

Dev-iL May 13, 2026 Author

Setup

Plans

Implementations

Comparison

Manual

Automated (Codex)

Direct answer

Key differences

1. Trigger coverage

2. Autofix behavior

3. Decorator matching

4. Mapping annotation detection

5. Ruff integration details

6. Test coverage

Main issues in each plan

Goblet issues to fix

Mango issues to fix

Recommendation

Dev-iL
Apr 24, 2026

Replies: 2 comments 5 replies

ntBre
Apr 24, 2026
Maintainer

Dev-iL Apr 24, 2026
Author

MichaReiser Apr 24, 2026
Maintainer

Dev-iL Apr 24, 2026
Author

MichaReiser Apr 24, 2026
Maintainer

Dev-iL Apr 24, 2026
Author

Dev-iL
May 13, 2026
Author