discuss & define data format #1

derhuerst · 2020-12-12T12:57:35Z

No description provided.

derhuerst · 2020-12-12T13:10:51Z

My 2 cents about the format the we specify an API endpoint with:

As the specific parameters of different backends (e.g. OpenTripPlanner, Navitia, HAFAS, EFA) are quite specific, let's keep the specified JSON fields specific to the backend type (i.e. different fields for OTP than for HAFAS) and define them only roughly. I'm of course fine with general properties of the API, such as a description of the data contained or the provider, to be specified in a consistent way.

Personally, I like the format used by kpublictransport a lot, but I propose to

rename the filter field (coarse bounding box of the covered area), possibly even split it into "largest area this API is known to return any data for" and "area this API is known to return canonical/detailed/exact data for".
change lineModeMap not to specify the semantics of the modes of transport used by the API, but rather just descriptions & localisations. Attempting to standardise modes of transport in a semantic way is very hard, and many others have tried and failed before.

In general, I'd like us to rapidly iterate on this format. If something doesn't fit, let's open a PR to change it and make a new version!

derhuerst · 2020-12-12T13:30:08Z

Currently, at least in the German & European area, we have several open source projects that already specify API endpoints in a somewhat generalised way:

@vkrause's kpublictransport src/lib/networks
@derf's EFA and HAFAS endpoints list
@schildbach's public-transport-enabler list of classes & @alexander-albers TripKit list of classes
@marudor's marudor.de HAFAS endpoint list
@derhuerst's hafas-client's list of "profiles" and, building on top, pan-european-public-transport's list of endpoints; @em0lar's pyhafas list of profiles
GNOME Map's list of transit plugins
transit.land's list of GTFS-RT feeds
openmobilitydata.org's list of feeds & corresponding API

Similar resources:

a list of European public transport operators with links to API endpoints & API clients, made by @juliuste

(edited to include the projects mentioned in #1 (comment))

vkrause · 2020-12-12T13:33:13Z

Regarding the KPublicTransport format:

it does encode the ISO 3166-1/2 region codes in the file name, it might be worth to list that explicitly in the JSON file (for one this is somewhat specific to our implementation, and more importantly, this doesn't work for international/interregional providers). We use this information to group endpoints in a sensible way in the UI (Öffi/Transportr use a similar approach IIRC).
the "KPlugin" wrapping for name/description is an artifact of our translation system, that is probably also not something that should leak into a shared format
anything under "options" is backend-specific, everything else is generic - keeping such a separation probably makes sense
the filter property naming is indeed not precise enough for what it does
lineModeMap: that is indeed somewhat specific to our implementation, only defining the mapping to Hafas line types here is indeed more generic. We can then still map that to our own types separately (PTE also does such a mapping IIRC).
"type" is a bit of a bad example in the de_db.json one you linked, as we special-case DB. See e.g. for something slightly more "normal": https://invent.kde.org/libraries/kpublictransport/-/blob/master/src/lib/networks/de_be_bvg.json - that would contain which Hafas/Efa/Otp/Navitia/etc variant this endpoint actually uses. There is potentially a bit of a blurry line between what is a different type, and what is just parameters for the same type, from what I have seen all our implementations more or less agree on this though (ie. things that use the same general interface and thus more or less the same client code are the same type).
the location identifier entries are also very specific to our implementation, although I could imagine @derhuerst having similar things for matching data between different backends.

derf · 2020-12-12T13:35:32Z

For geocoordinates, I propose:

reliableArea: [[lon, lat], [lon, lat], ...] is the polygon in which the service returns data with the maximum known amount of detail and accuracy. It should be set for each entry.
usableArea: [[lon, lat], [lon, lat], ...] is the polygon in which the service returns any kind of useful data. In case of HAFAS: For locations contained in usableArea, but not contained in reliableArea, data such as line numbers or train attributes may be missing, but core functionality (e.g. routing with real-time data) remains available. usableArea is optional; if unset, it is assumed to be identical with reliableArea.

vkrause · 2020-12-12T13:37:51Z

Additional projects with similar setups would be:

pyhafas: https://github.com/FahrplanDatenGarten/pyhafas
Gnome Maps: https://gitlab.gnome.org/GNOME/gnome-maps/-/tree/master/src/transitplugins

derhuerst · 2020-12-12T13:50:31Z

reliableArea [...] with the maximum known amount of detail and accuracy

usableArea [...] any kind of useful data. [...] such as line numbers or train attributes may be missing, but core functionality (e.g. routing with real-time data) remains available.

Your definition of "usable" is what I'd consider to be "reliable". 😬 The phrasing aside, I'd say there are several nuances/levels of data coverage:

Incomplete and/or shallow data about areas outside of their operating area, e.g. long-distance trains & buses but not local modes of transport, or planned data but not realtime data. You could say that this is the extent to which, in a region, the API provides any data.
Reasonably complete data, but realtime data from other operators is missing or inaccurate quite often, e.g. DB & SNCF
Data about their own vehicles, with a high level of detail and the most up-to-date realtime data.

Of course, we could make this distinction arbitrarily precise, which wouldn't help all of these projects.

vkrause · 2020-12-12T13:50:58Z

Attribution information would probably be also a good idea for proper Open Data backends, even if those are still rare. Example: https://invent.kde.org/libraries/kpublictransport/-/blob/master/src/lib/networks/no_entur.json#L38

derf · 2020-12-12T13:54:09Z

There are backends with more than one endpoints. For instance, most XML EFA backends provide both XSLT_DM_REQUEST (departure monitor) and XSLT_TRIP_REQUEST2 (routing). Similarly, HAFAS installations don't just have mgate.exe (with "crypto"), but also less capable, easier to use endpoints such as ajax-getstop.exe, trainsearch.exe or stboard.exe/bhftafel.exe.

As different andpoints have different requirements and configuration variables, we shouldn't just have one JSON file per endpoint, but also one type definition. E.g. efa_dmrequest, efa_triprequest, hafas_mgate and hafas_stationboard.

derhuerst · 2020-12-12T13:56:22Z

Attribution information would probably be also a good idea for proper Open Data [...].

Do you think it makes sense to use the datapackage.json spec (or just the field names) or some linked open data vocabulary for that? It is somewhat specific to files/blobs of data (vs. API endpoints), but we wouldn't add yet another ad-hoc standard to the ecosystem.

derhuerst · 2020-12-12T14:03:27Z

As different andpoints have different requirements and configuration variables, we shouldn't just have one JSON file per endpoint, but also one type definition. E.g. efa_dmrequest, efa_triprequest, hafas_mgate and hafas_stationboard.

I'm not sure if such an "enum of types of APIs" will scale well. As an example, if you consider HAFAS endpoints, there are those with "crypto", without "crypto", rest.exe APIs, stboard.exe APIs, ajax-getstop.exe APIs, extxml.exe APIs, query.exe APIs, and probably more that I don't know of.

vkrause · 2020-12-12T14:06:00Z

There are backends with more than one endpoints. For instance, most XML EFA backends provide both XSLT_DM_REQUEST (departure monitor) and XSLT_TRIP_REQUEST2 (routing). Similarly, HAFAS installations don't just have mgate.exe (with "crypto"), but also less capable, easier to use endpoints such as ajax-getstop.exe, trainsearch.exe or stboard.exe/bhftafel.exe.

As different andpoints have different requirements and configuration variables, we shouldn't just have one JSON file per endpoint, but also one type definition. E.g. efa_dmrequest, efa_triprequest, hafas_mgate and hafas_stationboard.

For Hafas that's the two types we have implemented indeed, mgate.exe or the (old?) query.exe/ajax-getstop.exe/stbboard.exe variant, modeled as different types as they both need different requests and different result parsing. We currently have only one endpoint for the latter (ie. query.exe/ajax-getstop.exe/stbboard.exe combined, not each of them individually) - example: https://invent.kde.org/libraries/kpublictransport/-/blob/master/src/lib/networks/ch_sbb.json

For EFA we have 1.5 variants: only a single request path, but two separate parsers depending on whether the result is the full XML or the mobile/compact variant. Our current config files model this as one type, with different parameters. This is also how we implement the small variations in the request parameters. We could also handle that as different types though, the impact on our implementation would be quite small.

The format is based on the discussion in #1 and subject to further changes.

derf · 2020-12-14T19:50:26Z

I'm not sure if such an "enum of types of APIs" will scale well. As an example, if you consider HAFAS endpoints, there are those with "crypto", without "crypto", rest.exe APIs, stboard.exe APIs, ajax-getstop.exe APIs, extxml.exe APIs, query.exe APIs, and probably more that I don't know of.

You're right. In fact, when it comes to the HAFAS query variant, some endpoints are mostly useless when viewed in isolation. For example, traininfo.exe is only usable with the trainLink obtained by using trainsearch.exe, so those should belong to the same JSON file.

I think it's time to start tinkering with JSON files (at least for me, having an example endpoint definition in a JSON file works much better than just reading a discussion thread). To this end, I have created two DB HAFAS definitions (one for mgate, one for query) and an EFA (VRR) definition. They're suggestions based on the discussion so far; feel free to edit them as you see fit.

For me, the following open questions remain:

how should we perform localization? The kpublictransport definitions look sensible to me, but I don't have experience in that area, so I'll leave that decision to you.
I can't think of a sensible distinction between usable/reliable/... areas, which is why I left out the coordinates in the example files. @derhuerst I suggest you just go ahead with the solution you prefer :)
Personally, I'd like an endpoint repository to document both sophisticated and simple API variants (e.g. both hafas-mgate and hafas-query). As db-hafas-mgate and db-hafas-query have the same provider, client software should be able to decide by itself whether it prefers the mgate or query API, so we don't need to specify a preference or otherwise indicate that they're identical. What do you think?
I'm not familiar with the DB HAFAS mgate endpoint, so I left the "type": "hafas_mgate_deutschebahn" special case nearly as-is. Feel free to change it it.

vkrause · 2020-12-15T15:48:51Z

For me, the following open questions remain:

* how should we perform localization? The kpublictransport definitions look sensible to me, but I don't have experience in that area, so I'll leave that decision to you.

For KPublicTransport this is connected to KDE's translation infrastructure, so they get translated automatically by just being there. No idea how we best handle that here.

* Personally, I'd like an endpoint repository to document both sophisticated and simple API variants (e.g. both hafas-mgate and hafas-query). As db-hafas-mgate and db-hafas-query have the same provider, client software should be able to decide by itself whether it prefers the mgate or query API, so we don't need to specify a preference or otherwise indicate that they're identical. What do you think?

Agreed. As long as there is a way to detect multiple endpoints for the same provider in client code I'd indeed let the client code decide on the priority. For single protocol clients this is simple anyway, multi-protocol clients should get good results by picking the better implemented or more powerful protocol first.

* I'm not familiar with the DB HAFAS mgate endpoint, so I left the `"type": "hafas_mgate_deutschebahn"` special case nearly as-is. Feel free to change it it.

I'd go with "hafas_mgate" here, the "deutschebahn" special case in KPublicTransport is for coach layout support, which is a bit out of scope here I guess.

derhuerst · 2020-12-16T13:53:25Z

data/de/db-hafas-mgate.json

@@ -0,0 +1,59 @@
+{
+  "name": "Deutsche Bahn (DB)",
+  "type": "hafas_mgate_deutschebahn",


As I said before, I don't think an "enum of types of APIs" will scale. I'd rather prefer something like "hafasMgate": true, because it can be combined with other flags describing the endpoint.

ah, I see what you meant there now. That would work for us too. Making type an array could be an alternative?

Fair enough. I prefer boolean flags (e.g. "hafasMgate": true) over a "type":["hafas_mgate", ...] array – checking whether a dict key exists is more straightforward than iterating over an array.

I don't care if its several boolean flags, or an array of flags. Both are a lot more future-proof than a single type enum.

derhuerst · 2020-12-16T13:59:27Z

data/de/db-hafas-mgate.json

+      "64": "Ferry",
+      "8": "Local Train (RE/RB)"
+    },
+    "locationIdentifierType": "db",


What is the idea behind this? @vkrause That the IDs returned by the endpoint are DB-style IBNRs?

KPublictTransport has multiple id "namespaces" per location. This can be endpoint-specific ones that have no meaning outside (the default), proprietary ones that are shared between two or more endpoints (BVG/VBB are such an example), or standard ones (IBNR, UIC, IFOPT, etc). This is useful for merging data from different sources (different backends, OSM, Wikidata, etc).

To support this we have the following settings:

locationIdentifierType defines the id namespace. This is optional for proprietary id spaces not used anywhere else.

For many Hafas-based endpoints there is the problem that they use an IBNR or UIC code for stations having one of those, but a proprietary numeric scheme for everything else. The standardLocationIdentifierType and standardLocationIdentifierCountries options address this, the list of covered UIC country codes is needed to reliably distinguish IBNR/UIC codes from other numeric values.

This is obviously very specific to what KPublicTransport does, not particularly elegant or generic, and for most existing users probably irrelevant. I could imagine something like this to be relevant for your merging work though?

This is obviously very specific to what KPublicTransport does, not particularly elegant or generic, and for most existing users probably irrelevant. I could imagine something like this to be relevant for your merging work though?

Yes, highly relevant for my merging work! In fact, there are several projects (in the European community) that try to cross-reference public transport "things" in some way.

This allows for fine-grained endpoint descriptions and should be more flexible than the enum approach.

It's a catch-all for trainsearch.exe, query.exe, traininfo.exe, stboard.exe and more, so the endpoint should contain the base path only.

vkrause · 2020-12-17T16:47:55Z

The options {} vs top-level keys split is another implementation detail of KPublicTransport worth reconsidering here, anything in options is protocol-specific, anything top-level is handled by generic infrastructure there. For single-protocol clients that separation is completely arbitrary though, and even for multi-protocol clients that split might be different.

derhuerst · 2020-12-20T20:15:55Z

The options {} vs top-level keys split is another implementation detail of KPublicTransport worth reconsidering here, anything in options is protocol-specific, anything top-level is handled by generic infrastructure there. For single-protocol clients that separation is completely arbitrary though, and even for multi-protocol clients that split might be different.

Having those entries that are unspecified by this spec in a nested object probably makes maintaining this spec easier, having them directly at the root level increases the usability. I don't really care about this though, I'd rather try in practice what we have.

derhuerst · 2020-12-20T20:19:16Z

LGTM for now!

derf · 2020-12-20T21:45:51Z

+1, let's make this v1 and see how it turns out in practice.

I moved the documentation to the main readme file and specified the language codes (I presume we're going to use ISO 639-1), so we should be good to go.

derhuerst · 2020-12-20T22:26:14Z

@vkrause Please merge if you think this looks good.

vkrause · 2020-12-23T13:44:14Z

Agreed, let's get this in, and continue in smaller/more focused PRs/issues to keep the discussion easier to follow.

The format is based on the discussion in #1 and subject to further changes.

WIP: Deutsche Bahn (de)

a7a35f6

derhuerst assigned derf, vkrause and derhuerst Dec 12, 2020

derf closed this Dec 12, 2020

derf reopened this Dec 12, 2020

add definitions for db-hafas-{mgate,query} and vrr-efa

1750c04

The format is based on the discussion in #1 and subject to further changes.

derhuerst commented Dec 16, 2020

View reviewed changes

derf added 2 commits December 16, 2020 19:00

switch from endpoint type as string/enum to a dictionary

33b79fa

This allows for fine-grained endpoint descriptions and should be more flexible than the enum approach.

db-hafas-query: remove trainsearch.exe from endpoint

1a915c1

It's a catch-all for trainsearch.exe, query.exe, traininfo.exe, stboard.exe and more, so the endpoint should contain the base path only.

vkrause and others added 2 commits December 19, 2020 16:42

Document type property, add proposal for a coverage property

09d4938

fix some typos and clarify coverage category / area description

9522945

derhuerst marked this pull request as ready for review December 20, 2020 20:19

move metdata format specification to root readme file

b0c90bd

readme: add required properties; specify supportedLanguages

fbfff85

readme: use JS syntax highlighter to allow for comments

43a0b81

vkrause merged commit 3cb0a99 into v1 Dec 23, 2020

vkrause pushed a commit that referenced this pull request Dec 23, 2020

add definitions for db-hafas-{mgate,query} and vrr-efa

b5f69a7

The format is based on the discussion in #1 and subject to further changes.

derhuerst deleted the initial-format branch January 7, 2021 14:05

derhuerst mentioned this pull request Jan 19, 2021

add HAFAS mgate.exe endpoints #7

Merged

37 tasks

derhuerst mentioned this pull request Aug 24, 2021

Country information in models public-transport/friendly-public-transport-format#68

Open

derhuerst mentioned this pull request Jan 31, 2025

Update Finnish OTP endpoints #103

Merged

discuss & define data format #1

discuss & define data format #1

Uh oh!

Conversation

derhuerst commented Dec 12, 2020

Uh oh!

derhuerst commented Dec 12, 2020

Uh oh!

derhuerst commented Dec 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vkrause commented Dec 12, 2020

Uh oh!

derf commented Dec 12, 2020

Uh oh!

vkrause commented Dec 12, 2020

Uh oh!

derhuerst commented Dec 12, 2020

Uh oh!

vkrause commented Dec 12, 2020

Uh oh!

derf commented Dec 12, 2020

Uh oh!

derhuerst commented Dec 12, 2020

Uh oh!

derhuerst commented Dec 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vkrause commented Dec 12, 2020

Uh oh!

derf commented Dec 14, 2020

Uh oh!

vkrause commented Dec 15, 2020

Uh oh!

derhuerst Dec 16, 2020

Choose a reason for hiding this comment

Uh oh!

vkrause Dec 16, 2020

Choose a reason for hiding this comment

Uh oh!

derf Dec 16, 2020

Choose a reason for hiding this comment

Uh oh!

derhuerst Dec 20, 2020

Choose a reason for hiding this comment

Uh oh!

derhuerst Dec 16, 2020

Choose a reason for hiding this comment

Uh oh!

vkrause Dec 16, 2020

Choose a reason for hiding this comment

Uh oh!

derhuerst Dec 20, 2020

Choose a reason for hiding this comment

Uh oh!

vkrause commented Dec 17, 2020

Uh oh!

derhuerst commented Dec 20, 2020

Uh oh!

derhuerst commented Dec 20, 2020

Uh oh!

derf commented Dec 20, 2020

Uh oh!

derhuerst commented Dec 20, 2020

Uh oh!

vkrause commented Dec 23, 2020

Uh oh!

Uh oh!

derhuerst commented Dec 12, 2020 •

edited

Loading

derhuerst commented Dec 12, 2020 •

edited

Loading