Skip to content

Support for multidimensional arrays in Croissant #649

@pierrot0

Description

@pierrot0

There is no way for now to express that a field should be a multidimensional array, for example a 4x4 matrix.

An example of dataset with such a need: MatrixCity (https://github.com/city-super/MatrixCity), where there is a rotation matrix field in the data (distributed as JSON in example):

        {
            "frame_index": 0,
            "rot_mat": [
                [
                    -0.009902680292725563,
                    0.0010966990375891328,
                    -0.0008568363264203072,
                    -590.0
                ],
                [
                    -0.0013917317846789956,
                    -0.0078034186735749245,
                    0.006096699275076389,
                    590.0
                ],
                [
                    -8.448758914703092e-10,
                    0.0061566149815917015,
                    0.007880106568336487,
                    200.0
                ],
                [
                    0.0,
                    0.0,
                    0.0,
                    1.0
                ]
            ],
            "euler": [
                0.6632251739501953,
                8.44875884808971e-08,
                -3.0019662380218506
            ]
        },

One possibility might be to use JSON schema to represent such an array:

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "type": "array",
  "items": {
    "type": "array",
    "items": {"type": "number"},
    "minItems": 4,
    "maxItems": 4
  },
  "minItems": 4,
  "maxItems": 4
}

The benefit here is that JSON schema is quite complete, so it would be possible to express complex cases, including arrays of different types (useful in multimodal prompts for example).

The downside is that the range of possible schemas is quite large, and there is the risk that some datasets would end-up with one field defined in Croissant, that field type being a complex JSON-schema described object... That would also significantly increase the implementation complexity.

A possible alternative might be to define our own Array dataType in the croissant namespace, similarly to cr:BoundingBox. For example, something like:

{
  "@type": "cr:Field",
  "@id": "recordsetName/rotation_matrix",
  "description": "The rotation matrix.",
  "dataType": "cr:Array",
  "dataTypeParams": {
    "dimensions": [4, 4],
    "dataType": "sc:Float"
  },
  "source": {
    "fileSet": { ... },
     "extract": {
        "jsonPath": "..."
     }
  }
}

What do you folks think?

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions