[FEATURE] Handling compact data representations for all connectors with minimum configurations#22
Conversation
Gatmatz
left a comment
There was a problem hiding this comment.
Overall, great work again, Vassilis. The examples are clear and to the point, and the code is very well documented. We should discuss the rest of my comments before merging.
| # Notice how we can directly reference individual fields (temperature, humidity, pressure) | ||
| # even though they arrive in compact format - Stream DaQ handles the unpacking automatically! | ||
| daq.add(dqm.count("pressure"), name="readings") \ | ||
| .add(dqm.missing_count("temperature") |
There was a problem hiding this comment.
Very good example. However, we should resolve issue #21 first before presenting this example.
There was a problem hiding this comment.
Thanks for pointing out! The fix to this bug is now merged.
| native_data_types = { | ||
| column_names[i]: values_dtype | ||
| for i in range(len(column_names)) | ||
| } |
There was a problem hiding this comment.
I don't really like the hypothesis that every column will be the same type. An input data stream can include a mix of strings, integers, and floats. Instead of passing a single type, we should consider using an array of types—one for each column.
There was a problem hiding this comment.
You are right, George! I see your point here.
I had in mind IoT scenarios, where typically all readings are float, but we should definitely support the general case as well. My only concern is that we should make it as easy for the user as possible and an array of types might not be the case. The user would have to make sure that ordering of types and column names match the actual ones, which becomes more cumbersome the more the fields. For example, consider that in the IoT scenario we discuss on Fridays we have 500 values in compact format (all float). Things become even more complicated if we decide to support schema evolution in the future.
I think we can find a sweet spot between easiness of use and complete control/expressiveness on the user end. Here is my proposal: We keep the values_dtype parameter as is, but we also add another parameter exceptions (optional). exceptions can be a dictionary in the form column_name: type, eliminating the need for the user to memorize the exact order and ensuring compatibility with schema evolution, if we decide so.
I am going to implement it in the revised PR. What do you think about it?
There was a problem hiding this comment.
I completely agree with your approach — I really like this idea of implementation.
Keeping values_dtype as the main parameter while adding an optional exceptions dictionary feels like a great balance between usability and flexibility. It keeps things simple for the common IoT case, while still allowing users to handle specific type overrides without worrying about column order or future schema changes.
| for i in range(len(column_names)) | ||
| } | ||
| # overwrite the default types with the user-specified exceptions, if any | ||
| if type_exceptions: |
There was a problem hiding this comment.
NEW to address the above comment
| Args: | ||
| values_dtype (Type): the (default) data type of the | ||
| individual values. | ||
| exceptions (Optional[dict[str, Type]], optional): A mapping |
There was a problem hiding this comment.
NEW to address review comments
Description
We can now offer data quality monitoring over data sources sending data in compact format. The user needs to define only the column names that contain the fields and values, while the rest of the conversion to native format is abstracted from the user. This feature is expected to have significant added value especially in IoT data quality monitoring scenarios.
Type of Change
Testing
Checklist
Fixes #7