Skip to content

Commit cd2cc2e

Browse files
authored
Merge branch 'main' into manifest
2 parents e385a17 + 41dc420 commit cd2cc2e

File tree

3 files changed

+12
-8
lines changed

3 files changed

+12
-8
lines changed

docs/croissant-spec-draft.md

Lines changed: 7 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ Authors:
1919
- Michael Kuchnik (Meta),
2020
- Jos van der Velde (OpenML),
2121
- Joaquin Vanschoren (OpenML),
22-
- Luis Oala (Dotphoton),
22+
- Luis Oala (brickroad.network),
2323
- Steffen Vogler (Bayer),
2424
- Mubashara Akthar (King’s College London),
2525
- Nitisha Jain (King’s College London),
@@ -318,7 +318,7 @@ In the rest of this document, we only describe the actual JSON-LD of Croissant m
318318

319319
Croissant builds on the [schema.org/Dataset](http://schema.org/Dataset) vocabulary, which is widely adopted by datasets on the web. An introduction to describing datasets with this vocabulary can be found [here](https://developers.google.com/search/docs/appearance/structured-data/dataset).
320320

321-
[Schema.org](http://Schema.org) properties are known to be very flexible in terms of the types of values they accept. We list below the main properties of the vocabulary and their expected type.To facilitate more consistent use of these properties we provide additional constraints on their usage in the context of Croissant datasets. We also specify cardinalities to clarify if a property can take one or many values.
321+
[Schema.org](http://Schema.org) properties are known to be very flexible in terms of the types of values they accept. We list below the main properties of the vocabulary and their expected type. To facilitate more consistent use of these properties we provide additional constraints on their usage in the context of Croissant datasets. We also specify cardinalities to clarify if a property can take one or many values.
322322

323323
We organize [schema.org](http://schema.org) properties in three categories: Required, recommended and other properties. The properties starting with the symbol `@` are defined in JSON-LD, which is our RDF syntax of choice for Croissant.
324324

@@ -927,11 +927,10 @@ A `Field` is part of a `RecordSet`. It may represent a column of a table, or a n
927927
</tr>
928928
<tr>
929929
<td>value</td>
930-
<td>JSON</a></td>>
930+
<td>JSON</td>>
931931
<td>ONE</td>
932932
<td>An optional constant value for the field. Fields with values can be used to attach key/value pairs to a RecordSet. The value of a field can be atomic, for fields with a simple dataType, or it can be structured, e.g., if the field has subfields. For the latter case, a JSON string can be used to represent the value.</td>
933933
</tr>
934-
<tr>
935934
<tr>
936935
<td>isArray</td>
937936
<td><a href="http://schema.org/Boolean">Boolean</a></td>
@@ -1278,7 +1277,7 @@ Other data types commonly used in ML datasets:
12781277
<td>cr:BoundingBox</td>
12791278
<td>Describes the coordinates of a bounding box (4-number array). Refer to the section "ML-specific features > Bounding boxes".</td>
12801279
</tr>
1281-
<tr>
1280+
<tr>
12821281
<td><a href="https://schema.org/VideoObject">sc:VideoObject</a></td>
12831282
<td>Describes a field containing the content of a video file.</td>
12841283
</tr>
@@ -1480,7 +1479,7 @@ Annotations can also appear at the level of a RecordSet. A RecordSet-level annot
14801479
{ "@type": "cr:Field", "@id": "movies/title", ...},
14811480
{ "@type": "cr:Field", "@id": "movies/genre", ...}
14821481
],
1483-
"annotation" : {
1482+
"annotation": {
14841483
"@type": "cr:Field", "@id": "movies/ratings",
14851484
"subField": [
14861485
{ "@type": "cr:Field", "@id": "movies/ratings/user_id", ...},
@@ -1604,8 +1603,8 @@ When a field is mapped to a property, it can inherit the range type of that prop
16041603

16051604
The following example shows a `RecordSet` where each record represents a city, typed as both a `wd:Q515` (Wikidata City) and `sc:GeoCoordinates`. The fields of the `RecordSet` are mapped to the properties of these classes, using both explicit and implicit mapping:
16061605
- The `cities/name` field corresponds to the `sc:name` property via implicit mapping
1607-
- The `citites/population` and `cities/country` fields are mapped to `wdt:P1082` and `wdt:P17` explicitly
1608-
- The `cities/latitude` and `cities/longitude` fiels implicitly map to `sc:latitude` and `sc:longitude`.
1606+
- The `cities/population` and `cities/country` fields are mapped to `wdt:P1082` and `wdt:P17` explicitly
1607+
- The `cities/latitude` and `cities/longitude` fields implicitly map to `sc:latitude` and `sc:longitude`.
16091608

16101609
```json
16111610
{

python/mlcroissant/mlcroissant/_src/structure_graph/nodes/file_object.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -123,6 +123,8 @@ class FileObject(Node):
123123
def __post_init__(self):
124124
"""Checks arguments of the node."""
125125
Node.__post_init__(self)
126+
if self.contained_in_v1_1:
127+
self.contained_in = self.contained_in_v1_1
126128
self.validate_name()
127129
uuid_field = "name" if self.ctx.is_v0() else "id"
128130
self.assert_has_mandatory_properties("encoding_formats", uuid_field)

python/mlcroissant/mlcroissant/_src/structure_graph/nodes/file_set.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44

55
from mlcroissant._src.core import constants
66
from mlcroissant._src.core import dataclasses as mlc_dataclasses
7+
from mlcroissant._src.core.context import CroissantVersion
78
from mlcroissant._src.core.uuid import formatted_uuid_to_json
89
from mlcroissant._src.structure_graph.base_node import Node
910
from mlcroissant._src.structure_graph.nodes.file_object import _contained_in_from_jsonld
@@ -86,6 +87,8 @@ class FileSet(Node):
8687
def __post_init__(self):
8788
"""Checks arguments of the node."""
8889
Node.__post_init__(self)
90+
if self.contained_in_v1_1:
91+
self.contained_in = self.contained_in_v1_1
8992
uuid_field = "name" if self.ctx.is_v0() else "id"
9093
self.validate_name()
9194
self.assert_has_mandatory_properties("includes", "encoding_formats", uuid_field)

0 commit comments

Comments
 (0)