Skip to content

Add Airflow operator for dataset de-identification #319

@maziyarpanahi

Description

@maziyarpanahi

Summary

Healthcare ETL is predominantly orchestrated through Airflow, and the v2.5/3.0 KPI targets broad ecosystem integration (§8.5). No task provides an Airflow operator. A thin OpenMedDeidentifyOperator wrapping the existing dataset-redaction runner (OM-055) lets data engineers drop OpenMed redaction into a DAG as a first-class task, with provider packaging kept optional.

Scope

  • Implement openmed/interop/airflow_operator.py exposing OpenMedDeidentifyOperator(BaseOperator) whose execute() calls the OM-055 redact_dataset runner (input path, text_columns, policy, output path).
  • Keep the import surface thin: subclass apache-airflow's BaseOperator behind a guarded import; core import must never import airflow.
  • Add an [airflow] extra to pyproject.toml; the operator emits the per-file audit summary (no raw PHI) into the task return value / XCom.
  • Document a minimal DAG snippet using the operator.
  • Unit test that instantiates the operator and calls execute() against a fixture file with airflow's BaseOperator stubbed/mocked, plus a core-no-import guard test.

Acceptance criteria

  • OpenMedDeidentifyOperator.execute() redacts a dataset file via the OM-055 runner and returns the audit summary.
  • pyproject.toml has an [airflow] extra; a test asserts core import does not import airflow.
  • A test runs execute() on a fixture file with airflow mocked and verifies output redaction.
  • A DAG usage snippet is documented.
  • test suite green: .venv/bin/python -m pytest tests/ -q

Out of scope

  • Publishing an apache-airflow-providers-openmed package.
  • Structured k-anonymity transforms (OM-044).
  • Streaming/sensor operators.

Files

  • openmed/interop/airflow_operator.py
  • pyproject.toml
  • tests/unit/interop/test_airflow_operator.py

Task: OM-154 · Milestone: Backlog · Priority: P3 · Size: S
Depends on: — · Blocks: —
Roadmap: §8.5 Ecosystem KPI (data-ecosystem integrations); OM-055 runner
Spec: PLANS/V2/EXECUTION/tasks/OM-154.md

Metadata

Metadata

Labels

P3StrategicfeatureNew capabilitygood first issueGood for newcomershelp wantedExtra attention is neededroadmap-v2OpenMed V2 roadmap backlog

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions