Skip to content

Conversation

@MarceloRobert
Copy link
Collaborator

@MarceloRobert MarceloRobert commented Nov 27, 2025

Description

Changes the insertion policy of the ingester from "never update" to "only update null fields". This follows the policy adopted by kcidb, including a PRIO_DB env var which tells which data to prioritize.

If PRIO_DB is set to True it will only allow updates on null data (allow null -> other value but not existing value -> other value and not existing_value -> null).
If set to False it will allow overwriting existing data as long as the new data is not null (allow existing value -> other value and allow null -> other value but not existing value -> null).

Changes

  • Added a new function to get the insertion query, in SQL, for any model
  • Used this function to get the query on consume_buffer instead of Django's bulk_create
  • Updated unit tests

How to test

Start the ingester and insert some data to it referencing the same object, check if the policy is respected.
Example realistic data for testing can be found here, but other tests are encouraged.

Future changes

  1. Having a dynamic approach to the insertion query gives us the benefit to not have to change it every time we change the models, and not need to make a separate piece of code for each model. However, it might be costly to recreate the same query all the time, so a static approach might be better. The dynamic generation could be contained in a command so that we can refresh the insertion queries when we want to;
  2. Adding automated integration tests so that we can guarantee that this policy is kept regardless of changes in the ingester is also useful.

Closes #1552

@MarceloRobert MarceloRobert self-assigned this Nov 27, 2025
@MarceloRobert MarceloRobert added the Database Issue that alters only configs of a database itself label Nov 27, 2025
@MarceloRobert MarceloRobert force-pushed the feat/ingester-django-tests branch 2 times, most recently from 2e861d1 to f761ab5 Compare November 27, 2025 18:00
@MarceloRobert MarceloRobert changed the title WIP: Feat/ingester django tests Feat: change ingester data insertion policy Nov 27, 2025
@MarceloRobert MarceloRobert marked this pull request as ready for review November 27, 2025 18:11
@MarceloRobert MarceloRobert force-pushed the feat/ingester-django-tests branch from f761ab5 to 8f5cafe Compare November 27, 2025 18:19
Copilot finished reviewing on behalf of MarceloRobert November 27, 2025 18:38
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR changes the ingester's data insertion policy from "never update" to "only update null fields", controlled by a new PRIO_DB environment variable. The implementation replaces Django's ORM bulk_create with raw SQL queries using PostgreSQL's ON CONFLICT clause with COALESCE operations to selectively update fields based on the priority policy.

Key changes:

  • Added dynamic SQL query generation function _generate_model_insert_query that creates upsert queries with field-level update control
  • Replaced bulk_create with raw SQL execution using cursor.executemany() in consume_buffer
  • Introduced PRIO_DB environment variable to toggle between database-priority and data-priority update modes

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 9 comments.

File Description
backend/kernelCI_app/constants/ingester.py Adds PRIO_DB constant with documentation (contains typo in env var name)
backend/kernelCI_app/models.py Adds module docstring noting explicit id column requirement for ingester
backend/kernelCI_app/management/commands/helpers/kcidbng_ingester.py Implements dynamic SQL query generation and replaces ORM with raw SQL for data insertion
backend/kernelCI_app/tests/unitTests/commands/monitorSubmissions/kcidbng_ingester_test.py Updates test to mock database connections instead of Django ORM methods

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

out(f"Unknown table '{table_name}' passed to consume_buffer")
raise

updateable_model_fields, query = _generate_model_insert_query(table_name, model)
Copy link

Copilot AI Nov 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _generate_model_insert_query function is called on every consume_buffer invocation, which means the same query is regenerated repeatedly for the same table. As noted in the PR description's "Future changes" section, consider caching the generated queries per table to avoid this overhead. The query structure is static for each model and only depends on PRIO_DB, which is set at startup.

Copilot uses AI. Check for mistakes.
Comment on lines +194 to +200
if isinstance(value, (dict, list)):
value = json.dumps(value)
Copy link

Copilot AI Nov 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type check isinstance(value, (dict, list)) only handles JSON serialization for dicts and lists, but doesn't handle other types that might need special handling (e.g., datetime objects, Decimal, custom objects). Consider adding more comprehensive type handling or documenting the assumption that only dicts/lists need JSON serialization.

Copilot uses AI. Check for mistakes.
Now allows for overwrites of null data

Closes kernelci#1552
@MarceloRobert MarceloRobert force-pushed the feat/ingester-django-tests branch from 8f5cafe to 8c15328 Compare November 27, 2025 19:15
INGEST_QUEUE_MAXSIZE = 5000


PRIO_DB = is_boolean_or_string_true(os.environ.get("PRIO_DB", "True"))
Copy link

@tales-aparecida tales-aparecida Nov 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think adding PRIO_DB is bringing a lot of complexity to this pull request. If we decide to change the policy again in the future, we can send another PR.
I believe we can assume you aren't going to need it, though.

mock_file_open.assert_called_once()


class TestGenerateInsertQuery:
Copy link

@tales-aparecida tales-aparecida Nov 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow, I never read this file before, it's quite high-level and scary.

Do you think you could add a test case achieving something close to:

input = json.loads(Path("fixtures/inputs/kcidb.json").read_text())
expect_data = json.loads(Path("fixtures/outputs/tables.json").read_text())

with self.assertNumQueries(5):
    ingest(input)

result_data = {
  "checkouts": Checkout.objects.all().values(),
  "builds": Build.objects.all().values(),
  "tests": Test.objects.all().values(),
  "issues": Issue.objects.all().values(),
  "incidents": Incident.objects.all().values(),
}
self.assertEqual(result_data, expected_data)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Database Issue that alters only configs of a database itself

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Verify if django-ingester is respecting kcidb limitations on fields update

2 participants