Replies: 28 comments
-
|
The distinction between physical representation and logical representation is known under many names, e.g. lexical space vs. value space in XSD. Any name may be confusing without explanation. The current form is ok but it might be better to switch to other names. In this case I'd also change "representation" because "representation of data" is confusing as well. My current best suggestion is to use lexical value and logical value instead of physical representation and logical representation. The current spec also uses "physical contents", this should be changed as well. |
Beta Was this translation helpful? Give feedback.
-
|
Thanks @nichtich, I agree I think, currently, confusion might occur because
Although, in general, I guess for majority of people |
Beta Was this translation helpful? Give feedback.
-
|
BTW
This sentence I think is very easy to understand so I guess |
Beta Was this translation helpful? Give feedback.
-
|
Hmm, I think a danger of replacing So I actually prefer the current term Although reading through the standards again I'm also now realizing that's not quite the case because we're allowing type info to be associated with JSON source data... so it's actually not purely textual/lexical in a strict sense, which complicates things. Does this mean we throw an error or warn if a numeric field finds numeric values as strings (e.g. "0", "1", "2") in JSON source data? What if a string field schema gets numeric values? etc. It'd simplify these cases if all "raw" data was just guaranteed to be parsed by the field schema as pure lexical/textual/string, and field props referencing In the spirit of brainstorming to get more ideas flowing: Other possible terms for Other possible terms for |
Beta Was this translation helpful? Give feedback.
-
|
Ok, the issue needs more. The whole section on Concepts needs to be rewritten to better clarify what is meant by "tabular data". Because we also have two levels of description:
There are "raw" tabular data formats (TSV/CSV) and there are tabular data formats with typed values (Excel, SQL, JSON, Parquet... limited to non-nested cells...). I'd say a Table Schema only refers to the former. A SQL Table can be converted to a raw table (just export as CSV) plus a Table Schema (inferred from the SQL Table definition) but SQL Tables are not directly described by Table Schema, nor is any JSON data as wrongly exeplified in the current specification. |
Beta Was this translation helpful? Give feedback.
-
Agreed! Perhaps it would clear some of the confusion if we renamed "Table Schema" to "Textual Table Schema" or "Delimited Table Schema" to reflect that the schema definition is specifically designed for textual data. It would also pave the way for future frictionless table schema standards for other types of physical data, e.g. "JSON Table schema", "Excel Table Schema", "SQL Table Schema", which would be designed around the particularities of the types found in those formats. In that case, we'd have: The physical values of Textual Table Schema are all strings As you say, it's much easier to think about conversions between formats, rather than type coercions if we try to use a textual table schema to parse an excel file, for example. The latter has a lot of potential complexity / ambiguity. |
Beta Was this translation helpful? Give feedback.
-
In
|
Beta Was this translation helpful? Give feedback.
-
|
The conversation is happening here so I'm adding @pwalsh's comment:
|
Beta Was this translation helpful? Give feedback.
-
|
First of all, probably I did not understand it correctly but I never thought about So my understanding is that every tabular data resource has a physical data representation (in my understanding of this term). On current computers, it's always just a binary that can be decoded to text in the CSV case or just read "somehow" in case of a non-textual format e.g Parquet. For every format there is a corresponding reader that converts that physical representation to a logic representation (e.g. a pandas dataframe from a csv or parquet file). I think here it's important to note that the Table Schema implementors never deal with any physical data representation (again based on my understanding of this term). Table Schema doesn't declare rules for csv parsers or parquet readers. In my opinion, Table Schema actually declared only post-processing rules for data that is already in its logical form (read by native readers). Physical Data -> [ native reader ] -> Source Logical Data -> [ table schema processor ] -> Target Logical Data For example, for this JSON cell
Another note, that from a implementor perspective, as said we only have access to Source Logical Data. It means that the only differentiable parameter for a data value is an source logical data type. For example, a Table Schema implementation can parse
So for me it feels that Table Schema's level of abstraction is to provide rules for processing "not typed" string values (lexical representation) and that's basically the only thing this spec really can define while low-level reading can't be really covered. So my point is that cc @peterdesmet |
Beta Was this translation helpful? Give feedback.
-
|
I tend to agree that we actually have 3 states of data in the spec, as you
write.
A few notes, though:
1 - you write "Table Schema doesn't declare rules for csv parsers".
However, the data package spec does have a csv dialect section and a
character encoding setting, which are precisely rules for csv parsers that
interact with the physical layer.
2 - 'source logical data' and 'target logical data' are not great names imo
as they impose some sort of order between the layers (source and target)
which does not apply in many cases (e.g. when writing a data package).
So, I would suggest to follow your lead, and use
- "physical layer" for the lower level binary data,
- "native format layer" for the data that the native, file format specific
drivers work with,
- and "logical layer" for the table-schema typed data
…On Thu, Jan 25, 2024 at 4:24 PM roll ***@***.***> wrote:
First of all, probably I did not understand it correctly but I never
thought about physical and logical in terms described here -
https://www.gooddata.com/blog/physical-vs-logical-data-model/. I was
thinking that in the case of Table Schema we're talking about basically a
data source (like 1010101 on the disc or so-called text in csv) and data
target (native programming types like in python and SQL).
So my understanding is that every tabular data resource has a physical
data representation (in my understanding of this term). On current
computers, it's always just a binary that can be decoded to text in the CSV
case or just read "somehow" in case of a non-textual format e.g Parquet.
For every format there is a corresponding reader that converts that
physical representation to a logic representation (e.g. a pandas dataframe
from a csv or parquet file).
I think here it's important to note that the Table Schema implementors
never deal with any physical data representation (again based on my
understanding of this term). Table Schema doesn't declare rules for csv
parsers or parquet readers. In my opinion, Table Schema actually declared
only post-processing rules for data that is already in its logical form
(read by native readers).
Physical Data -> [ native reader ] -> Source Logical Data -> [ table
schema processor ] -> Target Logical Data
For example, for this JSON cell 2000-01-01:
- physical data -- binary
- source logical data -- string
- target logical data -- date (the point where Table Schema adds its
value)
Another note, that from a implementor perspective, as said we only have
access to Source Logical Data. It means that the only differentiable
parameter for a data value is an input logical data type. For example, a
Table Schema implementation can parse 2000-01-01 string for a date field
because it knows an input logical type and a desired logical type. There is
no access to underlying physical representation to have more information
about this value. We only see that the input is string. For example,
frictionless-py differentiates all the input values into two groups:
- string -> process
- others -> don't process
So for me it feels that Table Schema's level of abstraction is to provide
rules for processing "not typed" string values (lexical representation) and
that's basically the only thing this spec really can define while low-level
reading can't be really covered
cc @peterdesmet <https://github.com/peterdesmet>
—
Reply to this email directly, view it on GitHub
<#864 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACAY5NUKZH4VDEZHTGRV3TYQJTJ3AVCNFSM6AAAAABBLUSTP6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMJQGMYTMOJTGY>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
|
Hi @nichtich, Are you interested in working on the updated version of #17 that incorporates comments from this issue? After working closely with the specs last month and refreshing in my memory implementation details from For example, for a JSON data file like this: [
["id", "date"],
[1, "2012-01-01"]
]We have:
I think this tiering is applicable to basically any input data source from I guess we need to rename the section to something like |
Beta Was this translation helpful? Give feedback.
-
Yes. I'd like to provide an update but I don't know when so it's also ok for me if you come up with an update. To quickly rephrase your words: We have three levels of data processing:
Table Schema specification defines how to map from level 2 to level 3. |
Beta Was this translation helpful? Give feedback.
-
I think it's a good wording!
Of course, no hurry at all. Let's just self-assign ourselfes to this issue if one of us decide start working (currently, I also have other issue to deal with first) |
Beta Was this translation helpful? Give feedback.
-
|
I agree but I have an observation here -
In @roll's example, it's mentioned that '1' is already a logical value.
I would claim that it's still a native value - a JSON number with the value
of 1. It might represent a table schema value of type integer, number,
year, or even boolean (with trueValues=[1]).
It might also be converted to None, e.g. in case missingValues=[1].
Therefore I would say that the distinction between native and logical is
correct and that all values start out as native values and get processed,
casted and validated into logical values - even if they come from a more
developed file format such as JSON. Then, in each case we require a value
to be present in the descriptor (e.g. in a max constraints, booleans
trueValues of missingValues) we need to specify whether a native value or a
logical value is expected there.
…On Wed, Feb 21, 2024 at 1:46 PM roll ***@***.***> wrote:
Table Schema specification defines how to map from level 2 to level 3.
I think it's a good wording!
Yes. I'd like to provide an update but I don't know when so it's also ok
for me if you come up with an update.
Of course, no hurry at all. Let's just self-assign ourselfes to this issue
if one of us decide start working (currently, I also have other issue to
deal with first)
—
Reply to this email directly, view it on GitHub
<#864 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACAY5POT5AGUWG4WD4M5BDYUXNCFAVCNFSM6AAAAABBLUSTP6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNJWGQ3TOMBUGU>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
Currently, it cannot because
I guess (2) might be cleaner and easier to explain. In this case it will be something like this e.g. for
|
Beta Was this translation helpful? Give feedback.
-
Good to introduce "native" as description of values before the logical level. A native boolean
All native values either have type that directly maps to a logical type (e.g. JSON Boolean and SQL BOOL both map to logical boolean value) or they are treated as strings.
Yes except replace "is represented lexically" with "is represented as string". If the native-data level already has a type compatible with datetype, no lexcial representation is involved at all. I think we are all on the same track but use slightly different terminology for the same idea. |
Beta Was this translation helpful? Give feedback.
-
|
I like the direction :) |
Beta Was this translation helpful? Give feedback.
-
|
If we lean towards 3 distinct layers (physical/native/logical) as an implementor I'm curious what will be the behaviour for this resource for example: data:
- [id]
- [1]
- [2]
- [3]
schema:
fields:
name: id
type: stringWill it be considered valid data, and values will be coerced to strings? Currently, Also, I think it's important to check what dataframe parsers (readr/pandas/polars/etc) do in this case so we don't end up with non-implementable solution |
Beta Was this translation helpful? Give feedback.
-
|
I like where this is going too, it's really clarifying the decision at hand: a) do we parse fields with a 2-layer physical / logical distinction or b) do we parse fields with a 3-layer physical / native / logical distinction The spec is currently written / defined as (a) a 2-layer scheme. This is why Supporting JSON in the An advantage of the 3-layer distinction is that in addition to JSON, it allows us to consider other intermediate typed sources (like SQL, ODF, etc), rather than being forced to convert all of the native types to The disadvantage of the 3-layer distinction is that I think it opens a can of worms of complexity. With 2 layers, we only have to define our Furthermore, with 3 layers we also need a way to losslessly represent native values in the TableSchema. For JSON types, this is easy, because the spec is JSON. But if we're envisioning support for other native types, we'd need ways to represent their native values in JSON. As @akariv said:
This is also apparent in the issue @roll describes re: numeric data. A JSON In addition to the example @roll provided above, 3-layer parsing also creates ambiguity in situations like: data:
- ["id"]
- ["1"]
- ["2"]
- ["3"]
schema:
fields:
name: id
type: integerdata:
- ["id"]
- [0]
- [true]
schema:
fields:
name: id
type: integerdata:
- ["id"]
- ["1"]
- [0]
- [true]
- ["true"]
schema:
fields:
name: id
type: booleandata:
- ["id"]
- ["0"]
- ["1"]
- [0]
- [1]
schema:
fields:
name: id
type: boolean
trueValues: ["1"]
falseValues: [0]If we have 2-layer parsing, that is, where all JSON native cell values are received by the TableSchema parser as
(I understand that our current implementation may slightly differ right now because it currently conflates the two- and three-layer parsing approaches) By contrast, 3-layer parsing creates a lot of questions:
3-layer parsing also creates problems for schema-sharing: data.csv: csvResource: {
"name": "csvResource",
"format": "csv",
"path": "data.csv",
"schema": "schema.json"
}jsonResource: {
"name": "jsonResource",
"format": "json",
"data": [["a"], [true], [false], [true]],
"schema": "schema.json"
},schema.json: {
"fields": [
{
"name": "myField",
"type": "boolean",
"trueValues": ["true"],
"falseValues": ["false"]
}
]
}With 2-layer parsing, this isn't a problem; the JSON and CSV files are interpreted exactly the same (as textual values). With 3-layer parsing, however, this may fail because the native values ...And this is just for JSON… we'd have to go through the same exercise for 3-layer parsing of SQL types, ODF types, etc and for those it'd be further complicated by a need to losslessly express their native values as JSON types. Much easier to to stick to the original 2-layer scope of frictionless being for textual tabular data, where by definition I like the idea of 3-layer parsing, but I think to support native types properly in the spec, TableSchema would have to be rebuilt from the ground up with support for lossless representations of native values, or we'd need to create additional versions of TableSchema that to map the subtleties of different native values of a specific format to frictionless fields e.g. |
Beta Was this translation helpful? Give feedback.
-
Note that it's not only about JSON; |
Beta Was this translation helpful? Give feedback.
-
|
I think it will be simple and correct to say that regarding the data model, Table Schema is no more than an extension of a native data format (all of them). This concept is quite simple, for example, we have JSON and there is SUPERJSON that adds support for PS.
|
Beta Was this translation helpful? Give feedback.
-
|
That might be confusing though.
E.g. a JSON file with -1 denoting an empty value, we would say
missingValues="-1". That's reasonable.
But what if 'n/a' is the empty value? Would we say missingValues="n/a" or
"\"n/a\"" (as is the physical representation of the value)?
What if there is no natural string representation of the value (if the file
format is not text based)?
…On Thu, Feb 22, 2024, 10:49 roll ***@***.***> wrote:
I think it will be simple and correct to say that regarding the data
model, Table Schema is no more than an extension of a native data format
(all of them). This concept is quite simple, for example, we have JSON and
there is SUPERJSON <https://github.com/blitz-js/superjson> that adds
support for date/time, regexp, etc. It's achieved via an additional layer
of serialization and deserialization for lexical values. If we think about
Table Schema that way than it's still the (1) data model and
missing/false/true values need to stay strings only
—
Reply to this email directly, view it on GitHub
<#864 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACAY5MOI4IZ2AUDBMQCE7TYU4BB5AVCNFSM6AAAAABBLUSTP6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNJYHE3TCNBQG4>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
I'm getting to thinking that we actually need to isolate Table Schema from any physical data representation and let it operate only on the logical level. On the logical level it's |
Beta Was this translation helpful? Give feedback.
-
|
It's 3 layers but we only have to think about two levels:
We should aim to be able to represent common data types in the type system of Table Schema but we don't have to ensure lossless mappings of native type systems. We define a set of data types such as string, number types, boolean, n/a... and either types of native format X directly map to one of these Table Schema types or implementations must downgrade their values, e.g. by serialization to string type values. P.S: Maybe this table of common native scalar data types helps to find out what is needed (also for #867). |
Beta Was this translation helpful? Give feedback.
-
|
I bootstrapped a new specification called "Terminology" - https://datapackage.org/specifications/glossary/ - I think it will be great to define everything we need there and then refer it across the specs. Lately I encountered that e.g. |
Beta Was this translation helpful? Give feedback.
-
I agree. It's always technically (at least) 3 layer, in that the source format needs to be parsed to get at the value cells. What I'm trying to get at is how we define the type signature of our field parsers. Right now the spec defines field / schema parsers as mappings from If we promote this to
I think I agree. As a textual format, the TableSchema should be defined (as it currently is) in terms of always be parsing serialized This way we keep and can avoid
This is another good approach worth exploring. The challenge will be to keep it backwards compatible... |
Beta Was this translation helpful? Give feedback.
-
|
Dear all, Here is a pull request based on @akariv's data model - #49 I think this simple 3-layered model highly improves the quality of the building blocks on which Data Package stands and simplifies field types a lot conceptually. Initially, I was more in favour of thinking about Table Schmea as a string processor (serialize/deserialize) but having An interesting fact is that after the separation of the native representation sections for field types, we can realize that field types basically don't have any description on a logical level—something to improve in the future, I guess, as currently, we mostly define only serialization/deserialization rules. Please take a look! |
Beta Was this translation helpful? Give feedback.
-
|
Great work @roll! I reviewed the PR and left a few minor comments. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Overview
This paragraph - https://datapackage.org/specifications/table-schema/#physical-and-logical-representation
I think
physicalterm might be confusing (see #621) as it seems to be really meaninglexicalortextualwhilelogicalsounds easy to understand in my opinion while it might still need to be brainstormedSubissues:
true/falseValues? #621Beta Was this translation helpful? Give feedback.
All reactions