Description
Existing data struct draft: #3627
Transform rule source files can specify a direction attribute, and in the case of direction: "both"
, such source files A-B.xml define both forward (A-B) and backward (B-A) transliterators. Bidirectional rule files can have rules that affect only A-B, only B-A, or both A-B and B-A.
How should we create data for bidirectional sources?
- Compile A-B into a data struct and B-A into a data struct (duplicating the rules that affect both directions)
- Create a bidirectional data struct that tries to avoid duplicate data
Discussed with @skius, @sffc, @Manishearth, @eggrobin, @younies:
- @robertbastian - (offline concern) I do not want to duplicate too much data, the VarTable in particular
- @skius - Storing bidirectional rules in a single struct affects locality at runtime, as a lot of rules (the ones from B-A) will just be skipped over when transliterating from A-B.
- @eggrobin - There are also validation issues when storing both directions in a single struct, as constructs on the source side might not be legal on the target side and vice versa.
- @sffc - A future enhancement concerning data duplication is to split out the VarTables as a separate data key which the transliterator data structs (otherwise as in suggestion 1.) will refer to. This would deduplicate the VarTables if they turn out to be a big issue size-wise.
- @younies - I am concerned about duplication in the individual rules, e.g., storing both a > b and a < b instead of a <> b.
- @sffc - Using option 1. actually helps against duplication, as users who only want one direction will not have the rules only affecting the other direction.
- @skius - Switches discussion to the data loading/API side. ICU supports specifying a direction when loading transliterators, i.e., load("A-B", Backwards) loads B-A. If we want to support this, should we just require the user to specify the necessary transliterators at datagen time? (eg, the user must specify the B-A transliterator)
- @sffc - I don't think our API has to support backwards loading. If users want A-B backwards, they should just load B-A. At datagen time, users should specify explicitly the ID of the transliterators they want, e.g., B-A.
- @Manishearth - (Reversing an ID might not be trivial)
Proposal: Data structs only store the forward direction, so bidirectional sources get compiled into two data structs (thus any shared data is duplicated). Datagen requires the user to specify the explicit transliterator including the direction (through ID syntax, not through “forward”/”backward” notation) they want, anything else (except transitive dependencies) does not get included. Future enhancement to avoid VarTable duplication is separating out the VarTable into a separate data key and referring to that from the transliterator data structs. Runtime also only accepts the direction implicitly through the bcp47 ID.
LGTM: @Manishearth @younies @eggrobin @sffc @skius
@robertbastian thoughts?