Skip to content

[FEATURE] Handling compact data representations for all connectors with minimum configurations#22

Merged
Gatmatz merged 9 commits into
mainfrom
compact-to-native
Oct 9, 2025
Merged

[FEATURE] Handling compact data representations for all connectors with minimum configurations#22
Gatmatz merged 9 commits into
mainfrom
compact-to-native

Conversation

@Bilpapster

@Bilpapster Bilpapster commented Oct 5, 2025

Copy link
Copy Markdown
Owner

Description

We can now offer data quality monitoring over data sources sending data in compact format. The user needs to define only the column names that contain the fields and values, while the rest of the conversion to native format is abstracted from the user. This feature is expected to have significant added value especially in IoT data quality monitoring scenarios.

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update

Testing

  • I have tested this change locally
  • I have added tests that prove my fix is effective or that my feature works
  • All new and existing tests pass

Checklist

  • My code follows the project's style guidelines
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation

Fixes #7

@Gatmatz Gatmatz left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, great work again, Vassilis. The examples are clear and to the point, and the code is very well documented. We should discuss the rest of my comments before merging.

Comment thread examples/compact_data.py
# Notice how we can directly reference individual fields (temperature, humidity, pressure)
# even though they arrive in compact format - Stream DaQ handles the unpacking automatically!
daq.add(dqm.count("pressure"), name="readings") \
.add(dqm.missing_count("temperature")

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good example. However, we should resolve issue #21 first before presenting this example.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing out! The fix to this bug is now merged.

Comment thread streamdaq/StreamDaQ.py
native_data_types = {
column_names[i]: values_dtype
for i in range(len(column_names))
}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really like the hypothesis that every column will be the same type. An input data stream can include a mix of strings, integers, and floats. Instead of passing a single type, we should consider using an array of types—one for each column.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, George! I see your point here.

I had in mind IoT scenarios, where typically all readings are float, but we should definitely support the general case as well. My only concern is that we should make it as easy for the user as possible and an array of types might not be the case. The user would have to make sure that ordering of types and column names match the actual ones, which becomes more cumbersome the more the fields. For example, consider that in the IoT scenario we discuss on Fridays we have 500 values in compact format (all float). Things become even more complicated if we decide to support schema evolution in the future.

I think we can find a sweet spot between easiness of use and complete control/expressiveness on the user end. Here is my proposal: We keep the values_dtype parameter as is, but we also add another parameter exceptions (optional). exceptions can be a dictionary in the form column_name: type, eliminating the need for the user to memorize the exact order and ensuring compatibility with schema evolution, if we decide so.

I am going to implement it in the revised PR. What do you think about it?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I completely agree with your approach — I really like this idea of implementation.

Keeping values_dtype as the main parameter while adding an optional exceptions dictionary feels like a great balance between usability and flexibility. It keeps things simple for the common IoT case, while still allowing users to handle specific type overrides without worrying about column order or future schema changes.

Comment thread streamdaq/StreamDaQ.py
for i in range(len(column_names))
}
# overwrite the default types with the user-specified exceptions, if any
if type_exceptions:

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NEW to address the above comment

Comment thread streamdaq/CompactData.py
Args:
values_dtype (Type): the (default) data type of the
individual values.
exceptions (Optional[dict[str, Type]], optional): A mapping

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NEW to address review comments

@Gatmatz Gatmatz merged commit 657dab7 into main Oct 9, 2025
3 of 4 checks passed
@Gatmatz Gatmatz deleted the compact-to-native branch October 9, 2025 12:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE] Compact data representation for MQTT input

2 participants