Thoughts on monorepos (dependency isolation) #4147

astrojuanlu · 2024-09-06T07:45:38Z

astrojuanlu
Sep 6, 2024
Maintainer

With the deprecation of micropackaging #3854, it would be good to explore monorepo approaches. What "monorepo" actually means depends on the use case.

astrojuanlu · 2024-09-06T07:45:46Z

astrojuanlu
Sep 6, 2024
Maintainer Author

One request that comes up from time to time is "1 Kedro project, different groups of dependencies". At the moment, 1 Kedro project = 1 Python package, therefore having completely disjoint dependencies is not possible. There are several possible solutions:

1 Python package
- one could have a very minimal set of common dependencies (say, just kedro>=0.19.6) and then groups of dependencies per use case using PEP 621 [project.optional-dependencies].
- There's also the draft PEP 735 about dependency groups (to be submitted in 2 weeks?)
1 "root" Python package, other connected packages
- There's uv workspaces (akin to Cargo monorepos) https://docs.astral.sh/uv/concepts/workspaces/

1 reply

astrojuanlu Oct 30, 2024
Maintainer Author

Requested clarification on "non-workspaces" astral-sh/uv#5605 (comment)

astrojuanlu · 2024-09-06T07:47:05Z

astrojuanlu
Sep 6, 2024
Maintainer Author

Beyond the "different groups of dependencies" use case, today a user asked about having different packages under src:

my_project
├── src
|   ├── use_case_1
|       ├── __init__.py
|       ├── __main__.py
|       ├── pipelines
|       |   ├── use_case_1_data_engineering
|       |   ├── use_case_1_modeling
|   ├── use_case_2
|       ├── __init__.py
|       ├── __main__.py
|       ├── pipelines
|       |   ├── use_case_2_data_ingestion
|       |   ├── use_case_2_data_science

I tried this with both setuptools and PDM build backends and it worked like a charm (hatchling on the other hand chocked). Sample pyproject.toml:

[project]
name = "multi"
version = "0.1.0"
readme = "README.md"
requires-python = ">=3.11"
dependencies = []

[build-system]
requires = ["setuptools"]
build-backend = "setuptools.build_meta"

However, I'm sure Kedro has some hardcoded logic that looks into a specific path inside src/. That logic also enables kedro pipeline create and other commands that manipulate the source code. So supporting this fully in Kedro might require some thought.

0 replies

astrojuanlu · 2024-10-17T22:04:19Z

astrojuanlu
Oct 17, 2024
Maintainer Author

To note, PEP 735 (Dependency Groups) is now accepted. https://peps.python.org/pep-0735/

1 reply

astrojuanlu Oct 26, 2024
Maintainer Author

uv and tox already shipped support for it

...but pip hasn't, and might never do it pypa/pip#12963 (comment)

astrojuanlu · 2024-10-17T22:18:07Z

astrojuanlu
Oct 17, 2024
Maintainer Author

Useful ecosystem overview of how other languages do this https://gist.github.com/konstin/6d04f111563641beb10facb617fe0eb3

0 replies

astrojuanlu · 2024-11-04T18:44:33Z

astrojuanlu
Nov 4, 2024
Maintainer Author

Yet another user concerned about the 1 Kedro pipeline = 1 set of requirements https://kedro.hall.community/evaluating-kedro-for-data-engineering-processes-qFYCwWk5VKQh

0 replies

astrojuanlu · 2024-12-10T11:47:44Z

astrojuanlu
Dec 10, 2024
Maintainer Author

@marrrcin @Galileo-Galilei @sbrugman would be interested if you also agree that dependency isolation is something we need to also consider within this problem space?

I think it's a valid consideration. At the very least, there needs to be an answer as to how you handle pipelines that have multiple sets of dependencies (e.g. should data_engineering and data_science be separate pipelines, or should they be separate Kedro projects, if they have a separate set of, potentially-incompatible, dependencies?). In my past work on large Kedro projects (ages ago), I heavily leveraged monorepo approaches, as in #4147, because there is also a tradeoff to maintaining multiple Kedro projects for a single use case.

I can't find my previous comments to this extent (perhaps they were verbal), but my opinions are:

Micro-packaging was always a bad idea (for Kedro framework to support); they were introduced to solve a QB-internal use case that never even leveraged them/built their own solution.

However, being able to define separate requirements for top-level pipelines IS useful, especially for deployment. The story for running a pipeline locally with multiple sets of dependencies was never really there, but it seems useful.

Supporting complex node grouping is unnecessary, at least to start with; that's akin to the micro-packaging over-engineering problem. The vast majority of users would benefit from being able to deploy each modular pipeline (with it's own namespace) separately. This also creates pretty clear boundaries where node persistence, etc. may be required.

TL;DR I agree with the idea of namespaces and supporting deployment based on namespaces, but I further posit that namespaces just being per modular pipeline is sufficient, even for deployment, and that will also make them easier to adopt.

Originally posted by @deepyaman in #4319 (comment)

2 replies

astrojuanlu Dec 10, 2024
Maintainer Author

I mostly agree with all of deepyaman's comments. I actually crave for pipelines requirements isolation, but I've never thought about requirements isolation at namespace level. This would make sense from an orchestrator point of view with "big nodes" filtered out from namespaces pipelines, but I am not sure this is what we should focus on in the short term because it's likely a hard problem.

Originally posted by @Galileo-Galilei in #4319 (comment)

astrojuanlu Dec 10, 2024
Maintainer Author

I agree with the need for dependency isolation for large projects 👍🏻

Originally posted by @marrrcin in #4319 (comment)

astrojuanlu · 2025-04-02T11:44:49Z

astrojuanlu
Apr 2, 2025
Maintainer Author

@datajoely says:

Also on the monorepo point I saw this recently

https://chrismati.cz/posts/uv-pex-monorepo/

(#4618 (comment))

1 reply

astrojuanlu Apr 2, 2025
Maintainer Author

And pex wouldn't even be necessary for us. Yes, I think it's clear uv got a lot of things right, also for monorepos :)

astrojuanlu · 2025-04-21T14:12:40Z

astrojuanlu
Apr 21, 2025
Maintainer Author

crazy idea for the "conflicting dependencies" problem: a modified find_pipelines so that you can do

# pipeline_registry.py

# from kedro.framework.project import find_pipelines
from kedro_monorepo.util import find_pipelines  # <-------------

...
def register_pipelines() -> dict[str, Pipeline]:
    # Unchanged
    pipelines = find_pipelines()
    pipelines["__default__"] = sum(pipelines.values())
    return pipelines

but this find_pipelines scans for importlib entry points, so that you can define your pipelines in external packages, that might have their own set of dependencies.

The only missing bit would be having some tooling to make this easier, like kedro-monorepo pipeline create ....

5 replies

datajoely Apr 22, 2025
Collaborator

this is clever, I was more thinking about a uv dependency group for each isolated pipeline

astrojuanlu Apr 23, 2025
Maintainer Author

That's simpler at the cost of keeping everything in 1 package. Worth evaluating pros and cons, or just ask users what they'd prefer

astrojuanlu Apr 23, 2025
Maintainer Author

Still, I think it's clear that, one way or another, this is easily solvable with current tech

astrojuanlu May 23, 2025
Maintainer Author

Been thinking more about the dependency group idea. I think it's an intermediate step towards something else - in other words, it doesn't solve the problem that micropackaging used to solve:

Micro-packaging allows users to share Kedro micro-packages across codebases, organisations and beyond. A micro-package can be any part of Python code in a Kedro project including pipelines and utility functions.

datajoely May 23, 2025
Collaborator

This is a fair, but in terms of history micropackaging was primarily introduced for an internal pain point which was still bypassed by internal teams with their own bespoke solution.

astrojuanlu · 2025-05-06T11:02:55Z

astrojuanlu
May 6, 2025
Maintainer Author

uv solves the dependency isolation issue for Bruin by just running everything in isolated, ephemeral environments https://www.linkedin.com/posts/burakkarakan_theres-no-other-tool-in-the-market-that-activity-7325440443893092353-rbFu?utm_source=share&utm_medium=member_desktop&rcm=ACoAAAXcavUB6j26_YjUtYinmzmdjTchguHHuG4

0 replies

gitgud5000 · 2025-05-19T00:26:21Z

gitgud5000
May 19, 2025

Hi @astrojuanlu, thanks for the insights above.

I have a Kedro project where different pipelines (or parts of them) run on heterogeneous infrastructure — some on AzureML with GPU/CUDA support, others on CPU-only CI/VMs.

Naturally, some parts of the project depend on GPU libraries like torch, which aren’t available or necessary in all environments. I want to avoid import-time errors/warnings when these libraries aren’t installed or CUDA devices aren’t available.

To work around this, I’m using a pattern like:

def gpu_check():
    if torch.cuda.is_available():
        import torch

Would you consider this an acceptable practice within Kedro's design?

Also, would you recommend managing these environment-specific dependencies via [project.optional-dependencies] in pyproject.toml (e.g., gpu, cpu, etc.), or is it better to go further and split into separate Kedro projects/packages if things get more complex?

I want to keep a clean, maintainable structure that avoids these import-time issues across environments.

Thanks!

1 reply

astrojuanlu May 19, 2025
Maintainer Author

[project.optional-dependencies] in pyproject.toml (e.g., gpu, cpu, etc.), or is it better to go further and split into separate Kedro projects/packages if things get more complex?

For the time being, I think the 3 approaches are basically fine:

Conditional imports (ugly, but it works)
Optional dependencies (might need conditional imports anyway, otherwise it might interfere with Kedro's discovery mechanism or give warnings)
Split the Kedro project (more below)

@datajoely is an advocate of (2), I think it solves the dependency issue but I think it might still require (1), which is ugly (haven't checked). It will definitely work though - notice that in recent versions, Kedro is fully compatible with all the Python Packaging standards.

In this ~~long monologue~~ thread I'm trying to figure out a way to do (3) such that instead of different Kedro projects, we still have 1 Kedro project + a leaner / more lightweight way of splitting the pipelines.

astrojuanlu · 2025-05-23T07:43:38Z

astrojuanlu
May 23, 2025
Maintainer Author

Another problem worth looking at while solving this is how to help teams separate the business logic from the Kedro logic even more. Some users put Kedro logic inside their node functions because they want to do dynamic or unconventional stuff. However, this makes their code more Kedro-dependent and less reusable.

1 reply

datajoely May 23, 2025
Collaborator

this is a good point, because I'd argue us autogenerating the nodes.py nudges people into this pattern of high coupling. In my opinion your declaration of 'flow', pipeline.py is correct, but your functional logic should live outside of Kedro in independently well tested packages.

Thoughts on monorepos (dependency isolation) #4147

Uh oh!

astrojuanlu Sep 6, 2024 Maintainer

Replies: 11 comments · 12 replies

Uh oh!

Uh oh!

astrojuanlu Sep 6, 2024 Maintainer Author

Uh oh!

astrojuanlu Oct 30, 2024 Maintainer Author

Uh oh!

astrojuanlu Sep 6, 2024 Maintainer Author

Uh oh!

astrojuanlu Oct 17, 2024 Maintainer Author

Uh oh!

astrojuanlu Oct 26, 2024 Maintainer Author

Uh oh!

astrojuanlu Oct 17, 2024 Maintainer Author

Uh oh!

astrojuanlu Nov 4, 2024 Maintainer Author

Uh oh!

astrojuanlu Dec 10, 2024 Maintainer Author

Uh oh!

astrojuanlu Dec 10, 2024 Maintainer Author

Uh oh!

astrojuanlu Dec 10, 2024 Maintainer Author

Uh oh!

astrojuanlu Apr 2, 2025 Maintainer Author

Uh oh!

astrojuanlu Apr 2, 2025 Maintainer Author

Uh oh!

Uh oh!

astrojuanlu Apr 21, 2025 Maintainer Author

Uh oh!

datajoely Apr 22, 2025 Collaborator

Uh oh!

astrojuanlu Apr 23, 2025 Maintainer Author

Uh oh!

astrojuanlu Apr 23, 2025 Maintainer Author

Uh oh!

astrojuanlu May 23, 2025 Maintainer Author

Uh oh!

datajoely May 23, 2025 Collaborator

Uh oh!

astrojuanlu May 6, 2025 Maintainer Author

Uh oh!

Uh oh!

gitgud5000 May 19, 2025

Uh oh!

astrojuanlu May 19, 2025 Maintainer Author

Uh oh!

astrojuanlu May 23, 2025 Maintainer Author

Uh oh!

datajoely May 23, 2025 Collaborator

astrojuanlu
Sep 6, 2024
Maintainer

Replies: 11 comments 12 replies

astrojuanlu
Sep 6, 2024
Maintainer Author

astrojuanlu Oct 30, 2024
Maintainer Author

astrojuanlu
Sep 6, 2024
Maintainer Author

astrojuanlu
Oct 17, 2024
Maintainer Author

astrojuanlu Oct 26, 2024
Maintainer Author

astrojuanlu
Oct 17, 2024
Maintainer Author

astrojuanlu
Nov 4, 2024
Maintainer Author

astrojuanlu
Dec 10, 2024
Maintainer Author

astrojuanlu Dec 10, 2024
Maintainer Author

astrojuanlu Dec 10, 2024
Maintainer Author

astrojuanlu
Apr 2, 2025
Maintainer Author

astrojuanlu Apr 2, 2025
Maintainer Author

astrojuanlu
Apr 21, 2025
Maintainer Author

datajoely Apr 22, 2025
Collaborator

astrojuanlu Apr 23, 2025
Maintainer Author

astrojuanlu Apr 23, 2025
Maintainer Author

astrojuanlu May 23, 2025
Maintainer Author

datajoely May 23, 2025
Collaborator

astrojuanlu
May 6, 2025
Maintainer Author

gitgud5000
May 19, 2025

astrojuanlu May 19, 2025
Maintainer Author

astrojuanlu
May 23, 2025
Maintainer Author

datajoely May 23, 2025
Collaborator