Skip to content

Conversation

aaronsteers
Copy link
Collaborator

feat: Add Kedro+Ibis POC with source-faker integration

Summary

This PR adds a standalone proof-of-concept demonstrating Kedro+Ibis integration patterns for scalable data pipelines. The POC is located in kedro-ibis-poc/ directory and implements a three-stage pipeline:

  1. Extract: Uses PyAirbyte's source-faker connector to generate test data (users, products, purchases)
  2. Staging: Transforms raw tables into cleaned staging views with column selection/renaming
  3. Fact: Creates aggregated fact tables (user purchase summaries, product sales summaries)

Key features:

  • Deferred Execution: Ibis pushes computation to DuckDB backend
  • Universal Interface: Same Ibis code works across 15+ backends
  • Official Integration: Uses kedro-datasets[ibis-duckdb] maintained by Kedro team
  • Declarative Configuration: OmegaConf variable interpolation for backend reuse

The implementation follows patterns from the Kedro+Ibis blog article and deepyaman/jaffle-shop reference implementation.

Review & Testing Checklist for Human

Critical (3 items) - This POC has not been tested end-to-end:

  • Run the pipeline end-to-end: cd kedro-ibis-poc && uv sync && kedro run - verify it completes without errors
  • Validate data outputs: Query DuckDB tables to ensure data extraction and transformations work correctly:
    duckdb data/kedro_ibis_poc.duckdb -c "SELECT COUNT(*) FROM raw_users; SELECT * FROM stg_users LIMIT 5; SELECT * FROM fct_user_purchases LIMIT 10;"
  • Verify dependency resolution: Ensure uv sync installs all required packages without conflicts (kedro~=0.19.0, airbyte~=0.24.2, ibis-framework[duckdb]~=9.0)

Notes

  • POC is standalone and not integrated with existing morph functionality per @aaronsteers request
  • Documentation kept minimal with external links per feedback
  • Development details moved to CONTRIBUTING.md to separate concerns
  • All lint checks passed, but type-check shows pre-existing issues in morph codebase (not related to POC)

Link to Devin run: https://app.devin.ai/sessions/9c3ec40c87694e3ba931dcc52837b7c3
Requested by: @aaronsteers

- Create standalone POC in kedro-ibis-poc/ directory
- Implement three-stage pipeline: extract → staging → fact
- Use PyAirbyte source-faker for data generation
- Use Ibis for universal data transformations
- Use DuckDB as analytical backend
- Follow patterns from Kedro+Ibis blog article and jaffle-shop reference
- Minimal README with external doc links per user feedback
- Separate CONTRIBUTING.md for development details

Co-Authored-By: AJ Steers <[email protected]>
Copy link
Contributor

Original prompt from AJ Steers
Received message in Slack channel #ask-devin-ai:

@Devin - Create a POC mockup for a POC on this foundation. Put your work into a new project folder in the morph (airbyte-integration-catalog) repo, but don't bias your work towards the existing content there. Let's treat this as net-new. <https://kedro.org/blog/building-scalable-data-pipelines-with-kedro-and-ibis>
Thread URL: https://airbytehq-team.slack.com/archives/C08BHPUMEPJ/p1759257854719199?thread_ts=1759257854.719199

Copy link
Contributor

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

Copy link

coderabbitai bot commented Sep 30, 2025

Important

Review skipped

Auto reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

✨ Finishing touches
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch devin/1759258209-kedro-ibis-poc

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

- Upgrade kedro to >=1.0.0 (was ~=0.19.0)
- Upgrade kedro-datasets to ~=8.1 (was ~=4.0)
- Upgrade airbyte to >=0.31.0 (was 0.24.2)
- Upgrade duckdb to ~=1.1 (was ~=1.0)
- Upgrade ibis-framework to ~=9.5 (was ~=9.0)

This resolves the 'TypeError: unhashable type: _duckdb.typing.DuckDBPyType' error
that was occurring with duckdb 1.x and duckdb-engine.

Pipeline implementation fixes:
- Add stream selection before source.read() in extract pipeline
- Fix staging transformations to match actual source-faker schema
  - products table has make/model, not name column
  - Use concat() for string concatenation instead of + operator
- Fix fact aggregations to use proper scalar operations
  - Use price.sum() instead of price * count()

Pipeline now runs successfully end-to-end:
- Extract: 1000 users, 100 products, 1000 purchases from source-faker
- Staging: Clean and transform raw tables into staging views
- Fact: Create aggregated fact tables (user purchases, product sales)

Tested with 'kedro run' - all 6 nodes complete successfully in ~9 seconds.

Co-Authored-By: AJ Steers <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant