Initial commit for snowplow source #15735

treff7es · 2025-12-19T11:04:00Z

No description provided.

codecov · 2025-12-19T11:06:35Z

❌ 3 Tests Failed:

Tests completed	Failed	Passed	Skipped
12579	3	12576	87

View the top 3 failed test(s) by shortest run time

tests.unit.test_unity_catalog_source.TestUnityCatalogSource::test_process_ml_model_generates_workunits

Stack Traces | 0.004s run time

self = <tests.unit.test_unity_catalog_source.TestUnityCatalogSource object at 0x7f0c21722a30>
mock_hive_proxy = <MagicMock name='HiveMetastoreProxy' id='139688225449008'>
mock_unity_proxy = <MagicMock name='UnityCatalogApiProxy' id='139688241344864'>

    @patch("datahub.ingestion.source.unity.source.UnityCatalogApiProxy")
    @patch("datahub.ingestion.source.unity.source.HiveMetastoreProxy")
    def test_process_ml_model_generates_workunits(
        self, mock_hive_proxy, mock_unity_proxy
    ):
        """Test that process_ml_model generates proper workunits."""
        from datetime import datetime
    
        from datahub.ingestion.api.common import PipelineContext
        from datahub.ingestion.source.unity.proxy_types import (
            Catalog,
            Metastore,
            Model,
            ModelVersion,
            Schema,
        )
    
        config = UnityCatalogSourceConfig.model_validate(
            {
                "token": "test_token",
                "workspace_url": "https://test.databricks.com",
                "warehouse_id": "test_warehouse",
                "include_hive_metastore": False,
            }
        )
    
        ctx = PipelineContext(run_id="test_run")
        source = UnityCatalogSource.create(config, ctx)
    
        # Create test schema
        metastore = Metastore(
            id="metastore",
            name="metastore",
            comment=None,
            global_metastore_id=None,
            metastore_id=None,
            owner=None,
            region=None,
            cloud=None,
        )
        catalog = Catalog(
            id="test_catalog",
            name="test_catalog",
            metastore=metastore,
            comment=None,
            owner=None,
            type=None,
        )
        schema = Schema(
            id="test_catalog.test_schema",
            name="test_schema",
            catalog=catalog,
            comment=None,
            owner=None,
        )
    
        # Create test model
        test_model = Model(
            id="test_catalog.test_schema.test_model",
            name="test_model",
            description="Test description",
            schema_name="test_schema",
            catalog_name="test_catalog",
            created_at=datetime(2023, 1, 1),
            updated_at=datetime(2023, 1, 2),
        )
    
        # Create test model version
        test_model_version = ModelVersion(
            id="test_catalog.test_schema.test_model_1",
            name="test_model_1",
            model=test_model,
            version="1",
            aliases=["prod"],
            description="Version 1",
            created_at=datetime(2023, 1, 3),
            updated_at=datetime(2023, 1, 4),
            created_by="test_user",
            run_details=None,
            signature=None,
        )
    
        # Process the model
        ml_model_workunits = list(source.process_ml_model(test_model, schema))
    
        # Should generate workunits (MLModelGroup creation and container assignment)
        assert len(ml_model_workunits) > 0
    
        assert len(source.report.ml_models.processed_entities) == 1
>       assert (
            source.report.ml_models.processed_entities[0][1]
            == "test_catalog.test_schema.test_model"
        )
E       AssertionError: assert 'e' == 'test_catalog.test_schema.test_model'
E         
E         - test_catalog.test_schema.test_model
E         + e

tests/unit/test_unity_catalog_source.py:550: AssertionError

tests.integration.mode.test_mode::test_mode_ingest_failure

Stack Traces | 0.115s run time

pytestconfig = <_pytest.config.Config object at 0x7f0911699450>
tmp_path = PosixPath('.../pytest-of-runner/pytest-0/test_mode_ingest_failure0')

    @freeze_time(FROZEN_TIME)
    def test_mode_ingest_failure(pytestconfig, tmp_path):
        with patch(
            "datahub.ingestion.source.mode.requests.Session",
            side_effect=mocked_requests_failure,
        ):
            global test_resources_dir
            test_resources_dir = pytestconfig.rootpath / "tests/integration/mode"
    
            pipeline = Pipeline.create(
                {
                    "run_id": "mode-test",
                    "source": {
                        "type": "mode",
                        "config": {
                            "token": "xxxx",
                            "password": "xxxx",
                            "connect_uri": "https://app.mode.com/",
                            "workspace": "acryl",
                        },
                    },
                    "sink": {
                        "type": "file",
                        "config": {
                            "filename": f"{tmp_path}/mode_mces.json",
                        },
                    },
                }
            )
            pipeline.run()
            with pytest.raises(PipelineExecutionError) as exec_error:
                pipeline.raise_from_status()
            assert exec_error.value.args[0] == "Source reported errors"
            assert len(exec_error.value.args[1]) == 1
            error_dict: StructuredLogEntry
>           _level, error_dict = exec_error.value.args[1][0]
            ^^^^^^^^^^^^^^^^^^
E           TypeError: cannot unpack non-iterable StructuredLogEntry object

.../integration/mode/test_mode.py:209: TypeError

tests.integration.mode.test_mode::test_mode_ingest_json_failure

Stack Traces | 0.197s run time

pytestconfig = <_pytest.config.Config object at 0x7f0911699450>
tmp_path = PosixPath('.../pytest-of-runner/pytest-0/test_mode_ingest_json_failure0')

    @freeze_time(FROZEN_TIME)
    def test_mode_ingest_json_failure(pytestconfig, tmp_path):
        with patch(
            "datahub.ingestion.source.mode.requests.Session",
            side_effect=lambda *args, **kwargs: MockResponseJson(
                json_error_list=["https://app.mode.com/api/modeuser"]
            ),
        ):
            global test_resources_dir
            test_resources_dir = pytestconfig.rootpath / "tests/integration/mode"
    
            pipeline = Pipeline.create(
                {
                    "run_id": "mode-test",
                    "source": {
                        "type": "mode",
                        "config": {
                            "token": "xxxx",
                            "password": "xxxx",
                            "connect_uri": "https://app.mode.com/",
                            "workspace": "acryl",
                        },
                    },
                    "sink": {
                        "type": "file",
                        "config": {
                            "filename": f"{tmp_path}/mode_mces.json",
                        },
                    },
                }
            )
            pipeline.run()
            pipeline.raise_from_status(raise_warnings=False)
            with pytest.raises(PipelineExecutionError) as exec_error:
                pipeline.raise_from_status(raise_warnings=True)
            assert len(exec_error.value.args[1]) > 0
            error_dict: StructuredLogEntry
>           _level, error_dict = exec_error.value.args[1][0]
            ^^^^^^^^^^^^^^^^^^
E           TypeError: cannot unpack non-iterable StructuredLogEntry object

.../integration/mode/test_mode.py:287: TypeError

To view more test analytics, go to the Test Analytics Dashboard
_{📋 Got 3 mins? Take this short survey to help us improve Test Analytics.}

alwaysmeticulous · 2025-12-19T11:10:52Z

✅ Meticulous spotted 0 visual differences across 967 screens tested: view results.

Meticulous evaluated ~8 hours of user flows against your PR.

_{Expected differences? Click here. Last updated for commit 87527a6. This comment will update as new commits are pushed.}

codecov · 2025-12-19T12:36:34Z

Bundle Report

Bundle size has no change ✅

…ssors/standard_schema_processor.py Co-authored-by: aikido-pr-checks[bot] <169896070+aikido-pr-checks[bot]@users.noreply.github.com>

aikido-pr-checks · 2025-12-19T12:37:44Z

...data-ingestion/src/datahub/ingestion/source/snowplow/processors/standard_schema_processor.py

+        try:
+            if not url.startswith(("http://", "https://")):
+                raise ValueError("Invalid URL scheme")
+            response = requests.get(url, timeout=10)


Potential user input in HTTP request may allow SSRF attack - high severity
If an attacker can control the URL input leading into this HTTP request, the attack might be able to perform an SSRF attack. This kind of attack is even more dangerous if the application returns the response of the request to the user. It could allow them to retrieve information from higher privileged services within the network (such as the metadata service, which is commonly available in cloud services, and could allow them to retrieve credentials).

Remediation: If possible, only allow requests to allowlisting domains. If not, consult the article linked above to learn about other mitigating techniques such as disabling redirects, blocking private IPs and making sure private services have internal authentication. If you return data coming from the request to the user, validate the data before returning it to make sure you don't return random data.
^{View details in Aikido Security}

treff7es added 7 commits December 15, 2025 16:11

Initial commit for snowplow source

84497c5

Snowplow improvements

0d3013c

Couple of improvements

a61c6e9

Major refactor

38edcd0

Add couple of fixes

9927383

Add couple of scaling improvements

dadc119

commit missing files

1e192e3

github-actions bot added ingestion PR or Issue related to the ingestion of metadata product PR or Issue related to the DataHub UI/UX devops PR or Issue related to DataHub backend & deployment labels Dec 19, 2025

github-actions bot deployed to datahub-wheels (Preview) December 19, 2025 11:06 View deployment

github-actions bot deployed to datahub-project-web-react (Preview) December 19, 2025 11:10 View deployment

vercel bot had a problem deploying to Preview December 19, 2025 11:11 Failure

Update metadata-ingestion/src/datahub/ingestion/source/snowplow/proce…

8d928f3

…ssors/standard_schema_processor.py Co-authored-by: aikido-pr-checks[bot] <169896070+aikido-pr-checks[bot]@users.noreply.github.com>

aikido-pr-checks bot reviewed Dec 19, 2025

View reviewed changes

github-actions bot deployed to datahub-wheels (Preview) December 19, 2025 12:38 View deployment

github-actions bot deployed to datahub-project-web-react (Preview) December 19, 2025 12:41 View deployment

vercel bot had a problem deploying to Preview December 19, 2025 12:44 Failure

Test linting

0717644

github-actions bot deployed to datahub-wheels (Preview) December 19, 2025 14:28 View deployment

github-actions bot deployed to datahub-project-web-react (Preview) December 19, 2025 14:31 View deployment

vercel bot had a problem deploying to Preview December 19, 2025 14:33 Failure

Fix linter errors and tests

5228e1c

github-actions bot deployed to datahub-wheels (Preview) December 19, 2025 16:15 View deployment

Fix markdown linter issues

0f3afb6

github-actions bot requested a deployment to datahub-project-web-react (Preview) December 19, 2025 16:18 Abandoned

github-actions bot deployed to datahub-wheels (Preview) December 19, 2025 16:21 View deployment

github-actions bot deployed to datahub-project-web-react (Preview) December 19, 2025 16:24 View deployment

vercel bot had a problem deploying to Preview December 19, 2025 16:27 Failure

Fix more linter issue

3039c8c

github-actions bot deployed to datahub-wheels (Preview) December 19, 2025 16:53 View deployment

github-actions bot deployed to datahub-project-web-react (Preview) December 19, 2025 16:56 View deployment

vercel bot had a problem deploying to Preview December 19, 2025 16:59 Failure

Fixing linter issues

bf43e05

vercel bot deployed to Preview December 20, 2025 09:31 View deployment

Merge branch 'master' into snowplow_source

d43a149

github-actions bot deployed to datahub-wheels (Preview) December 20, 2025 17:39 View deployment

github-actions bot deployed to datahub-project-web-react (Preview) December 20, 2025 17:42 View deployment

vercel bot deployed to Preview December 20, 2025 17:54 View deployment

Fix enums

9eea3e6

github-actions bot deployed to datahub-wheels (Preview) December 20, 2025 19:06 View deployment

github-actions bot deployed to datahub-project-web-react (Preview) December 20, 2025 19:09 View deployment

vercel bot deployed to Preview December 20, 2025 19:20 View deployment

Improve test docker timeouts

c31eef0

github-actions bot deployed to datahub-wheels (Preview) December 21, 2025 13:41 View deployment

github-actions bot deployed to datahub-project-web-react (Preview) December 21, 2025 13:44 View deployment

vercel bot deployed to Preview December 21, 2025 13:54 View deployment

github-actions bot deployed to datahub-wheels (Preview) December 21, 2025 14:11 View deployment

Fix test

a1c0380

github-actions bot deployed to datahub-wheels (Preview) December 21, 2025 17:24 View deployment

github-actions bot deployed to datahub-project-web-react (Preview) December 21, 2025 17:27 View deployment

vercel bot deployed to Preview December 21, 2025 17:37 View deployment

Improve test coverage

87527a6

github-actions bot deployed to datahub-wheels (Preview) December 22, 2025 08:57 View deployment

github-actions bot deployed to datahub-project-web-react (Preview) December 22, 2025 09:01 View deployment

vercel bot deployed to Preview December 22, 2025 09:11 View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Initial commit for snowplow source #15735

Initial commit for snowplow source #15735

Uh oh!

treff7es commented Dec 19, 2025

Uh oh!

codecov bot commented Dec 19, 2025 •

edited

Loading

Uh oh!

alwaysmeticulous bot commented Dec 19, 2025 •

edited

Loading

Uh oh!

codecov bot commented Dec 19, 2025

Uh oh!

aikido-pr-checks bot Dec 19, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Initial commit for snowplow source #15735

Are you sure you want to change the base?

Initial commit for snowplow source #15735

Uh oh!

Conversation

treff7es commented Dec 19, 2025

Uh oh!

codecov bot commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

❌ 3 Tests Failed:

Uh oh!

alwaysmeticulous bot commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Dec 19, 2025

Bundle Report

Uh oh!

aikido-pr-checks bot Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov bot commented Dec 19, 2025 •

edited

Loading

alwaysmeticulous bot commented Dec 19, 2025 •

edited

Loading

aikido-pr-checks bot Dec 19, 2025 •

edited

Loading