Skip to content

JSONL formatted input #347

@daleroberts

Description

@daleroberts

Note that JSON was already requested in #51

I would like to suggest that we implement JSONL input format as I believe it shouldn't be too hard to do.

Mapping XML schema to JSONL schema

The DynaML XML format is already a tree, so it would map 1:1. A station in XML:

  <DnaStation>
    <Name>ALICE</Name>
    <Constraints>CCC</Constraints>
    <Type>LLH</Type>
    <StationCoord>
      <Name>ALICE</Name>
      <XAxis>-23.6701</XAxis>
      <YAxis>133.8855</YAxis>
      <Height>603.35</Height>
    </StationCoord>
    <Description>Alice Springs</Description>
  </DnaStation>

becomes:

  {"DnaStation":{"Name":"ALICE","Constraints":"CCC","Type":"LLH","StationCoord":{"Name":"ALICE","XAxis":"-23.6701","YAxis":"133.8855","Height":"603.35"},"
  Description":"Alice Springs"}}

Instead of trying to debate what should be the new schema, I suggest that the element and field names stay identical to the XML format. This means no new schema to learn and no mapping ambiguity.

Business logic / data tests

Keeping the same schema will allow us to refactor the code easily and apply the same data validation rules to both the XML and JSON input formats. E.g.,

  • There must be a positive number of directions
  • You can't have blank Type
  • Are the standard deviations valid

and so forth.

JSONL: Line-oriented JSON

Having worked with GeoJSON and GeoJSONL (line-oriented version) on previous projects. I would like to suggest that we use JSONL (line-oriented JSON) instead of standard JSON (See #51). This means that we would have one line per station or measurement.

Why is JSONL better than JSON?

Although JSON is clean and widely used and there exists rich tooling (jq, Python json module, etc), there are some disadvantages:

  • It is still a single document (you can't concatenate: cat vic.json nsw.json > vic+nsw.json)
  • You must parse the whole file in to memory to work with it
  • No comments allowed so we would loose survey metadata comments

By using JSONL instead, we would get:

  • Concatenation is easy: cat survey1.jsonl survey2.jsonl > combined.jsonl works!
  • On concatenation there are no headers to strip, no closing tags, which makes batch workflows easier
  • Line-oriented streaming/reading: read a line, parse it, discard. Memory usage stays flat regardless of file size. Same performance profile as the current (XML) SAX parser but with 20x simpler code
  • jq handles it natively: jq '.DnaStation.Name' < stations.json works on JSONL without flags. jq -s survey.jsonl slurps it into an array if you need that
  • grep works: grep '"Type":"G"' measurements.json finds all GPS baselines. You can't do that with XML (tags span lines unpredictably) or JSON.
  • wc -l gives you record count (minus header line)
  • head -n 100 gives you the first 99 records. tail, split, shuf all work
  • Easy to generate: no need to track document state, just print one JSON object per line
  • python -c "import json, sys; [print(json.dumps(r)) for r in records]" is all you need to parse it in Python.
  • Parallel processing: split file by line count, process chunks independently

Some cons:

  • No comments either (same as JSON). Survey metadata would need to go in the header object or a separate metadata line.
  • Less human-readable than pretty-printed JSON/XML for single records (everything on one line), but you should be able to jq -s < survey.json to get that?
  • Not as universally known as JSON or XML. Some people haven't seen this format before

Interesting use cases

Extract all GPS baselines from measurements and import

grep '"Type":"G"' measurements.json > gps.jsonl
dnaimport gps.jsonl

Dependencies

We could use:

  • simdjson (super fast!)
  • nlohmann/json (fast)

Considerations

We could probably do:

  • JSON, and
  • JSONL

Metadata

Metadata

Assignees

Labels

New featureRequest a new feature or function

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions