Skip to content

Conversation

@gustavobtflores
Copy link
Contributor

@gustavobtflores gustavobtflores commented Nov 13, 2025

Objective

Implement hardware status denormalization to improve query performance for hardware listing data

Problem Solved

The current hardware listing endpoint suffers from performance issues due to complex joins across multiple tables (checkouts, tests, builds). This PR introduces a denormalized approach to aggregate hardware status data.

Key Changes

  • Added 4 new models (HardwareStatus, LatestCheckout, PendingTest, ProcessedHardwareStatus) with appropriate indexes
  • Integrated aggregation logic into the kcidbng_ingester to checkouts and pending tables during data ingestion
  • Created process_pending_aggregations management command for batch processing of pending tests with configurable batch size and loop mode
  • Added get_hardware_listing_data_from_status_table function to query denormalized data instead of complex joins
  • Modified hardwareView to leverage new denormalized table for improved response times with optional "feature flag"

How to Test

1. Run Database Migration

cd backend
poetry python manage.py migrate

2. Process Test Submissions (Optional - populate test data)

Extract submissions.zip to a folder and run the ingester:

poetry python manage.py monitor_submissions --spool-dir ./{folder-name}

3. Test Backlog Processing Command

Single batch processing:

poetry python manage.py process_pending_aggregations --batch-size 1000

Continuous loop mode:

python manage.py process_pending_aggregations --loop --interval 60 --batch-size 1000

4. Verify Hardware Status Data

Check that hardware_status table is populated in PostgreSQL using psql or other software that can communicate with the DB

5. Test Hardware Listing Endpoint

Query the hardware listing endpoint and verify improved performance and correct data aggregation.

@gustavobtflores gustavobtflores force-pushed the feat/hardware-denormalization branch from ae2fc62 to 0175d2c Compare November 17, 2025 21:16
origin = models.CharField(max_length=100)
platform = models.CharField(max_length=100)
compatibles = ArrayField(models.TextField(), null=True)
start_time = models.IntegerField()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we don't have buckets anymore, keep this as datetime field


class Meta:
db_table = "hardware_status"
unique_together = ("checkout_id", "origin", "platform")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd have origin, platform first so indexes reaching only them will work.

also need another index on origin, platform, start_time... since we also query based on those.



class LatestCheckout(models.Model):
checkout_id = models.TextField(primary_key=True)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: group the unique together fields, then the other fields (checkout_id and start_time)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also not likely, but checkout_id is not a PK, the same commit can be in multiple trees...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checkout_id is not a PK, the same commit can be in multiple trees

checkout_id is always a single row in the checkouts table, so the same checkout_id can't be from multiple trees, but we can have multiple checkouts for the same tree, in which case this is the point of the LatestCheckout table, to only store the latest checkout for each tree



class PendingBuild(models.Model):
id = models.TextField(primary_key=True)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's this id? i guess it's build_id... so name it accordingly to make it easier to correlate in the code.

I'm also thinking if we have guarantees this won't conflict across trees/origins...



class PendingTest(models.Model):
id = models.TextField(primary_key=True)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto, i guess it's test_id

id = models.TextField(primary_key=True)
checkout_id = models.TextField()
status = models.CharField(
max_length=10, choices=StatusChoices.choices, blank=True, null=True

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this really be null or blank?

also maybe we convert this into the subset we care (pass/fail/inc), it doesn't need to be the same choice, we can use a reduced subset with single letter DB storage (1 byte instead of a string)

compatible = ArrayField(models.TextField(), null=True)
build_id = models.TextField()
status = models.CharField(
max_length=10, choices=StatusChoices.choices, blank=True, null=True

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@MarceloRobert MarceloRobert added enhancement New feature or request Database Issue that alters only configs of a database itself labels Nov 18, 2025
@gustavobtflores gustavobtflores force-pushed the feat/hardware-denormalization branch 7 times, most recently from 547b86d to c36d84d Compare November 21, 2025 13:57
@gustavobtflores gustavobtflores changed the title wip: hardware denormalization for hardware listing feat: hardware denormalization for hardware listing Nov 21, 2025
@gustavobtflores gustavobtflores marked this pull request as ready for review November 21, 2025 14:06
@gustavobtflores gustavobtflores force-pushed the feat/hardware-denormalization branch from c36d84d to f428a90 Compare November 21, 2025 14:10
@gustavobtflores gustavobtflores force-pushed the feat/hardware-denormalization branch 2 times, most recently from 63eb9f7 to aee6f18 Compare November 21, 2025 16:52
Comment on lines 38 to 45
valid_checkout_ids = set(
LatestCheckout.objects.values_list("checkout_id", flat=True)
)

orphaned_entries = HardwareStatus.objects.exclude(
checkout_id__in=valid_checkout_ids
)
orphaned_count = orphaned_entries.count()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this must be inside the transaction

)

total_deleted = 0
while True:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, because of the slice below

"hardware_key",
"entity_id",
"entity_type",
blank=True,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blank=true?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the default for composite primary key is blank=True and apparently it cannot be changed to False, didn't found why exactly yet

https://github.com/django/django/blob/main/django/db/models/fields/composite.py#L53-L73

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

answering myself:

https://github.com/django/django/blob/stable/5.2.x/django/db/models/fields/composite.py#L50

        if not kwargs.setdefault("blank", True):
            raise ValueError("CompositePrimaryKey must be blank.")

Comment on lines 54 to 56
if status not in simplified_status_to_count:
return simplified_status_to_count[SimplifiedStatusChoices.INCONCLUSIVE]
return simplified_status_to_count[status]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how? if all status are in there. But anyway, it's better to use .get(status, SimplifiedStatusChoices.INCONCLUSIVE) than to double-lookup

Comment on lines 125 to 127
is_already_processed = to_process in existing_processed

if is_already_processed:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why assign to variable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added just for readability, but yeah, to_process in existing_processed already sounds readable enough

Comment on lines 38 to 45
valid_checkout_ids = set(
LatestCheckout.objects.values_list("checkout_id", flat=True)
)

orphaned_entries = HardwareStatus.objects.exclude(
checkout_id__in=valid_checkout_ids
)
orphaned_count = orphaned_entries.count()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this must be inside the transaction

while True:
with transaction.atomic():
batch_ids = list(
orphaned_entries.values_list("checkout_id", flat=True)[:batch_size]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the values_list("checkout_id", flat=True) you can specify when you create the orphaned_entries, here you just slice

test, h_key, status_record, existing_processed, new_processed_entries
)

_process_build_status(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only process build if the test was processed (return boolean in _process_test_status and check before this line)

Comment on lines 212 to 213
build,
h_key,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note if we include the test.id, type=TEST in the hw_key above, we'll need to compute a new hw_key for the build, ok?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please create an issue for adding unit tests for this and the other functions added in this PR just so that we don't forget



class LatestCheckout(models.Model):
checkout_id = models.TextField(primary_key=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checkout_id is not a PK, the same commit can be in multiple trees

checkout_id is always a single row in the checkouts table, so the same checkout_id can't be from multiple trees, but we can have multiple checkouts for the same tree, in which case this is the point of the LatestCheckout table, to only store the latest checkout for each tree

Comment on lines +81 to +86
if len(result.hardware) < 1:
return create_api_error_response(
error_message=ClientStrings.NO_HARDWARE_FOUND,
status_code=HTTPStatus.OK,
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why don't you check for this len < 1 (btw why not len == 0) with the hardwares_raw instead of letting them go through the model validation? If there is no hardware then I see that there won't be any item to validate, but it seems out of order to me

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just followed the current approach on hardwareView, but you're right, we can check it earlier

@gustavobtflores gustavobtflores force-pushed the feat/hardware-denormalization branch from acd80d1 to eb0fbff Compare November 26, 2025 22:49
Comment on lines 98 to 99
h_key = get_hardware_key(test.origin, test.platform, checkout.id)
contexts.append((test, build, checkout, h_key))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when we move to include entity_id/type into the get_hardware_key() we must generate 2 h_key, one for the test and another for the build, report both in contexts and keys_to_check


def _get_existing_processed(keys_to_check: set[bytes]) -> set[ProcessedHardwareStatus]:
"""Fetch existing processed entries from the database."""
return set(ProcessedHardwareStatus.objects.filter(hardware_key__in=keys_to_check))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when we move to hardware_key to include entity_id/type it will be a PK, so we can change this line as well and .values("hardware_key", flat=True)

Comment on lines 118 to 127
to_process = ProcessedHardwareStatus(
hardware_key=h_key,
entity_id=test.test_id,
entity_type=HardwareStatusEntityType.TEST,
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when we move to hardware_key to include entity_id/type, then test_h_key is already what we'll check in existing_processed, no need to create the model instance here.

Comment on lines 160 to 171
to_process = ProcessedHardwareStatus(
hardware_key=h_key,
entity_id=build.id,
entity_type=HardwareStatusEntityType.BUILD,
)

if to_process in existing_processed:
return

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when we move to hardware_key to include entity_id/type, then build_h_key is already what we'll check in existing_processed, no need to create the model instance here.


existing_processed = _get_existing_processed(keys_to_check)

for test, build, checkout, h_key in contexts:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here you will get test_h_key and build_h_key, then be careful when calling _process_*_status()

def process_pending_batch(self, batch_size: int) -> int:
last_processed_test_id = None

while True:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we should hold a transaction inside the loop... or outside (more correct), but that may hold the lock for too long, need to check how it behaves with real data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it is safer to keep the transaction inside the loop, with that we don't lose time reprocessing the previous batches if some of them fails to commit

ready_tests: Sequence[PendingTest],
ready_builds: dict[str, Builds],
) -> int:
with transaction.atomic():

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then this transaction is in process_pending_batch and not here

Comment on lines 321 to 332
pk = models.CompositePrimaryKey(
"hardware_key",
"entity_id",
"entity_type",
)
hardware_key = models.BinaryField(
max_length=32
) # this holds a sha256, thus digest_size = 32 bytes
entity_id = models.TextField()
entity_type = models.CharField(
max_length=1, choices=HardwareStatusEntityType.choices
)
Copy link

@barbieri barbieri Nov 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should include entity_id/type inside the hardware_key, currently the lookup for hardware_key will return all entities ever processed for that hardware, as the system grows and we never delete ProcessedHardwareStatus entries, the set will be large and this will hurt performance (and memory usage).

maybe we should keep a checkout_id field from a cron job we should remove those that are not related to the latest checkouts, same as we did with delete_unused_hardware_... (or even do it from inside that job)

checkout_id: str
test_origin: str
platform: str
compatibles: Optional[str]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this an array of strings?

Comment on lines 39 to 41
def get_hardware_key(origin: str, platform: str, checkout_id: str) -> bytes:
"""Generate a hash (hardware key) from origin, platform, and checkout ID."""
return hashlib.sha256(f"{origin}|{platform}|{checkout_id}".encode("utf-8")).digest()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add entity_id/type

@gustavobtflores gustavobtflores force-pushed the feat/hardware-denormalization branch from eb0fbff to c95ea60 Compare November 27, 2025 18:05
- Add HardwareStatus model to store aggregated test/build/boot status by platform
- Add LatestCheckout model to track latest checkout per origin/tree/repository
- Add PendingTest model to queue tests for aggregation processing
- Add ProcessedHardwareStatus model to track processed entities
- Add SimplifiedStatusChoices enum for aggregated status values
- Create database migration for new tables with appropriate indexes
- Add aggregation_helpers module with checkout and test aggregation logic
- Integrate aggregation into kcidbng_ingester db_worker
- Populate LatestCheckout table with checkout tracking
- Populate PendingTest table with test records for processing
- Run aggregations before buffer flush to ensure data consistency
- Add process_pending_aggregations management command for backlog processing
- Implement batch processing with configurable batch size and loop mode
- Add hardware status aggregation from PendingTest queue
- Add get_hardware_listing_data_from_status_table query function
- Query denormalized hardware_status table for improved performance
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: rename migration to a more meaningful name

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Database Issue that alters only configs of a database itself enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants