Conversation
| 2. Support for JSON values in batch operations (deferred to future work) | ||
| 3. Automatic schema inference or generation | ||
| 4. Support for non-JSON structured data formats (Protocol Buffers, XML, etc.) | ||
|
|
There was a problem hiding this comment.
- Support for querying within JSON objects. Query is by parent container name and meant for retrieval only
| "full_load_voltage": 11.95, | ||
| "regulation_percent": 1.24 | ||
| }, | ||
| "ripple_analysis": { |
There was a problem hiding this comment.
this is a good example of where we'll really want to discourage use? Like... this is good for not losing data being logged hierarchically in TestStand but all the other clients have never logged this type of data. Other than customers writing their own viewers, how do we expect this information to be used? Are we going to build visualizers for any of this kind of data?
|
|
||
| **Approach:** Add `string json_string` to the `oneof` with a parallel `string json_schema_id` field. | ||
|
|
||
| **Pros:** |
There was a problem hiding this comment.
So a pro that is missing here is that this solution works for LV. That's actually the biggest con that is missing from this proposal. Since struct.proto contains recursive type references, grpc-labview cannot successfully generate code for any proto files which import this type. The reason for this limitation is that by default LV types are all value types so for the same reasons you cannot have recursive type references in C++/C# structs, you also cannot have them in LV.
So you either have to:
- Decide you do not care about LV APIs for this proposal. That means you might still end up having to have a copy of the .proto file with json values stripped out of it that so that
grpc-labviewcan still generate code for the portions of the API which it can support. - Invest in
grpc-labviewso that it:- Simply ignores any portions of an API which result in recursive type references. Perhaps we would want to create LV specific options in the proto file to control this behavior.
- Works natively with recursive type references. This would be a non-trivial effort and would presumably require the code generator to define and create DVRs (data value references) automatically as it encounters a recursive type reference cycle. I haven't given this as much thought as it needs so it's possible that creating DVRs automatically on the fly isn't even practical from user code.
- Add custom support for
struct.protosuch that the C++ code parses the binary format forstruct.protoand then converts it to a string data type at the LV API where it can be consumed via json primitives and third party LV addon libraries.
| "description": "Comprehensive power supply test results including output characteristics, ripple analysis, and transient response", | ||
| "type": "object", | ||
| "properties": { | ||
| "output_voltage": { |
There was a problem hiding this comment.
This is related but somewhat orthogonal to this proposal. I see that this example schema is using snake case which is common for Python and some web APIs. It's also the recommended style for field names in .proto files. However, I believe camel case is most common for json data and web APIs as a whole. To a large degree, this is outside of our control as the user's data will ultimately control this. However, are there any recommendations/guidelines that we provide around casing in regards to data? Is there any standardization we should strive for with NI authored data and examples?
jasonmreding
left a comment
There was a problem hiding this comment.
Waiting on what the proposed plan is for LV APIs. If we don't plan to support LV, then that should be clearly stated in the proposal.
|
|
||
| The current measurement data infrastructure supports a fixed set of strongly-typed value types (scalar, vector, waveform, XY data, spectrum, etc.). Users across multiple domains — including NI TestStand sequences, Python-based test automation, and other MDS clients — often work with complex, hierarchical data structures that don't map cleanly to these predefined types. | ||
|
|
||
| For example, TestStand customers commonly create arbitrary data structures in their code modules, and Python API users frequently work with nested dictionaries and domain-specific objects. When these users attempt to publish their data to Measurement Data Services (MDS), they face significant challenges because their existing data structures can't be represented directly. |
There was a problem hiding this comment.
Can you provide some examples of ways that customers do this that are clearly "measurement" data and not "metadata" I'm struggling to come up with these use cases.
When we put data in measurements, we need to think about how the whole workflow will work with that data.
What should be the Nigel interactions with these types of measurements? How would we expect to visualize them? When would users need these instead of just having multiple measurements with different names?
There was a problem hiding this comment.
There's a good example below in python around line 331. We see clusters like that come out of LabVIEW steps in TestStand pretty often. SystemLink Test Monitor accepts containers like that and serializes them into json blobs for storage.
How does SystemLink allow for viewing that data and what lessons about that should we be carrying into MDS?
There was a problem hiding this comment.
SystemLink has a fundamental weakness in that it allows arbitrary nesting of objects (like this proposal). To get around this, the TestStand results processor (for SystemLink) standardizes the objects that SystemLink receives into a standard structure with one measurement value per measurement object. This is very similar to what MDS has today. This standardization allows SystemLink to visualize any measurement against any other measurement (or condition or metadata field). When data comes in with arbitrary structure, SystemLink is not able to handle the data as well and the manipulation and analysis of the data is left to the user using the APIs. This has been a serious sticking point to the adoption of SystemLink APIs outside of TestStand and FlexLogger.
Too much flexibility is a problem.
There was a problem hiding this comment.
SystemLink has a fundamental weakness in that it allows arbitrary nesting of objects (like this proposal). To get around this, the TestStand results processor (for SystemLink) standardizes the objects that SystemLink receives into a standard structure with one measurement value per measurement object. This is very similar to what MDS has today. This standardization allows SystemLink to visualize any measurement against any other measurement (or condition or metadata field). When data comes in with arbitrary structure, SystemLink is not able to handle the data as well and the manipulation and analysis of the data is left to the user using the APIs. This has been a serious sticking point to the adoption of SystemLink APIs outside of TestStand and FlexLogger.
To check my understanding, are you saying: "SystemLink prefers standard structure, but can receive unstructured data. If it's unstructured, there's not much we can do with it by default"?
FWIW, I agree with this. This is why the schema is critical - if we have a schema, we have a chance at doing more with the data than if it is just an unstructured blob. Also, we're able to verify/validate that the data matches the expectation of the consumer of the data (or designer of the test system).
However, requiring the schema incurs a significant burden - especially on customers with existing "container shaped" data.
These two things feel fundamentally at odds - especially in the case of TestStand - where a customer is not going to know the shape/schema for all their data. If TestStand automatically generates the schema, we're likely losing out on much of the desired validation.
I see value in an API like this that fully describes the shape and allowed values for "container shaped" data, but I'm not sure it offers much value to TestStand without significant UX work on the TestStand side.
There was a problem hiding this comment.
As such, I'm not sure it's something we should be prioritizing right now. Creating an API like this was a request from the TestStand team - so we will need more iteration to close on how to make it useful to TestStand.
There was a problem hiding this comment.
For the TestStand result processor example, if we recursively traversed the data value until we reached intrinsic types we recognize and created names corresponding to the data hierarchy (using period or underscore to concatenate the name of each parent field to the leaf field), would that satisfy the need? It kind of sounds like that is what SystemLink is doing. Or does generating names in this fashion create too much noise? It probably still doesn't work well for arrays of containers, but maybe that is less common?
However, requiring the schema incurs a significant burden - especially on customers with existing "container shaped" data.
It sounds like we are telling the user that if they want to use MDS, they either have to:
- Register a schema for any data they want to log that doesn't conform to our default data schema.
- Modify any existing measurement IP to produce data values that conform to the MDS schema.
And there is user push back to either of those approaches.
There was a problem hiding this comment.
It sounds like we are telling the user that if they want to use MDS, they either have to:
Register a schema for any data they want to log that doesn't conform to our default data schema.
Modify any existing measurement IP to produce data values that conform to the MDS schema.
And there is user push back to either of those approaches.
Yes, I think that's the case. From the start, we've said that a core principle of mds is that we will be able to validate/verify the data that is published. We will not allow blobs, and we will not allow drift/mistakes in the incoming data. That's - to some extent - fundamentally at odds with a drop in, no work required, data logging solution. If you've got an existing system, you're going to have to do one of those two things to get data into MDS. I don't think that's an unreasonable stance, but it is problematic for TestStand.
There was a problem hiding this comment.
From the start, we've said that a core principle of mds is that we will be able to validate/verify the data that is published.
why isn't that an opt-in?
There was a problem hiding this comment.
I am trying to imagine how this proposal helps the MDS TestStand Result Processor publish arbitrary measurements. In the current implementation we evaluate the data type before publishing the measurement. The property object predetermines type based on propertyObject.Type.ElementType.
For a non-trivial container which does is not a known type (Boolean, String, Number, AnalogWaveform, DigitalWaveform) we have no idea what the underlying representation looks like.
Would we:
- Build a JSON schema and do adhoc registration of newly constructed schema?
- Require TestStand users to look at all their output parameter types and manually define JSON schema for the value type they intend to log?
There was a problem hiding this comment.
important point here. The thought behind TestStand container support is to capture the existing step output into MDS where containers are used (lots of legacy code). If the approach is "register dynamically every container we encounter in TestStand" then we will have undercut the purpose of schemas. I still like schemas ONLY for enforcing required shapes (as opt-in for customers) and not as a way to reject unknown data.
There was a problem hiding this comment.
Yeah, if you are going to mandate the schema, then writing a generic logger is basically impossible unless you already have knowledge ahead of time of all possible data types that will be published. Even then, without "named types" like you get with an Any, knowing which schema to use with which data values is not really possible.
|
|
||
| ## Open Questions | ||
|
|
||
| 1. **Schema versioning**: Should we support versioned schemas? If a schema evolves, should old measurements reference the old schema version? |
There was a problem hiding this comment.
I like the idea of version schemas. Would recommend adding serial number to schema and data store. If a user save an entity with version 1.0, then allow to be deserialized as 1.0 JSON object.
If a user updates the schema, then they can start saving with version 2.0 of the JSON schema.
|
|
||
| ### Power Supply Characterization Result | ||
|
|
||
| ```json |
There was a problem hiding this comment.
This is an interesting example! One of the main values of having a schema is that it allows users to look at data in standard ways instead of being at the whim of the test author. In this example, I see several measurements, several conditions, and several pieces of metadata. (I'll comment on the individual ones and how I see them being classified.)
If we just allow arbitrary JSON like this, we put the onus back on the data consumer (AI or human) to extract data from here, re-normalize it into something they can analyze, and then process it. I think this is a step backwards for measurements.
There was a problem hiding this comment.
an alternative to "we put all the onus on the data consumer" that's used internally is that the test writer also provides guidance for how to view the data: adding semantic descriptions, default views, etc. So it becomes schema for viewing data for people that don't understand the tests natively (read: downstream managers and engineers). The standardized tooling then allows for generic data discovery but also auto-discovers the views that were originally intended. This would help things like AI as much as people.
Such a system btw doesn't require the test writer modify the test code
| "description": "Comprehensive power supply test results including output characteristics, ripple analysis, and transient response", | ||
| "type": "object", | ||
| "properties": { | ||
| "output_voltage": { |
There was a problem hiding this comment.
output_voltage is a scalar measurement. I would expect this to be pulled out as a single measurement under a step in the MDS schema.
| "type": "number", | ||
| "description": "Measured DC output voltage in volts" | ||
| }, | ||
| "output_current": { |
There was a problem hiding this comment.
output_current is also a measurement, it should also be pulled out as a single measurement in the MDS schema.
| "type": "number", | ||
| "description": "Measured DC output current in amperes" | ||
| }, | ||
| "load_regulation": { |
There was a problem hiding this comment.
load_regulation is a collection of three measurements that are correlated with each other. no_load_voltage, full_load_voltage, and regulation_percent are all scalar measurements that should be represented as such. The correlation between these measurements could be done by matching the conditions under which they were taken, or (more simply) by parenting them to the same test step. (An assumption we commonly make in SystemLink is that a common parent test step implies correlation of measurement values.)
There was a problem hiding this comment.
right... part of this design document said that flattening this could be an option and also loses the hierarchical organization that the test intended. At the very least you'd have to prefix all the children because you could have two containers with the exact same leaf names
|
I agree with many that it would be better if we did not have to add this support, and it feels like it is really easy to abuse it when someone puts a general purpose wrapper on MDS and publishes all data this way. If we have to support arbitrary value types I think this is a good approach. I do worry that longer term we are going to face hurdles with this value type being present. We don't have any understanding of the value and making sure that we can do interesting things with it. Are there any additional fields we could have as a requirement that would help with understanding? I don't have any off the top of my head but I think would be good to put some thought into it. |
|
@csjall, @adamarnesen, @rfriedma, @jasonmreding - @ccifra 's comment above captures succinctly what I've been trying to say in my other various comments. As Adam points out, I think MDS will be most successful if we have good adoption of a known set of types that directly map/relate to the measurement/condition domains. I think we already have relatively good coverage of that set of types (though I think we need to add better support for complex data types, and perhaps things like If we think that we really need container shaped data, then I think the hurdle of requiring a schema is appropriate. It's not the desired path so a little bit of friction isn't a bad thing. Without a schema, we have no idea what's in the blob, and no way to ensure consistency in the data. Given the requirement that our service needs to understand the types and shape and names of the contained data, I think this API proposal is reasonable, and I don't see an alternative at the service level. With an API like this, I think the tooling/work to support customers moves out to the client, For instance, I can imagine things like a "wizard" in TestStand that will look at container shaped data you'd like to publish and offer to help you build and register a schema (or offer ways to flatten the data out). But that's all downstream from this API proposal, and is well supported by this proposal. |
|
To expand a bit on @nick-beer's comment: Another solution (other than requiring schemas) would be to introduce some kind of mapping document based on the type of the container. The mapping would tell the MDS API layer what fundamental fields to map each element of the container to. This would allow the data consumers to minimize the amount of customization they have to understand while still allowing for flexibility in the original data schema. This approach would look similar to what DIAdem does today with DataPlugins. The plugin provides the mapping from a format that doesn't conform with the TDM model to the TDM model. It is allowed to be lossy, and the responsibility is the users to update the plugin when they change their data format/structure. |
|
by the way, I outright removed the work item tracking this due to lack of forward progress. We will drop some TestStand data on the ground until something like this is implemented. |
What does this Pull Request accomplish?
Lays out the design for arbitrary/json values in MDS.
Why should this Pull Request be merged?
We need to add support for arbitrary data values in MDS. This document should help us to align on the design before we create the implementation PR.
What testing has been done?
N/A