Skip to content

JSON parser doesn't parse all inputstream if a stream contain multiple inputStream. #7

Open
@hiroyuki-sato

Description

@hiroyuki-sato

Overview.

The following configuration gets JSON data using HTTP results
It outputs nine entries, But the current parser outputs three entries only.

in:
  type: http
  url: http://express.heartrails.com/api/json
  params:
    - {name: method, value: getStations}
    - {name: x, value: 135.0}
    - {name: y, value: "35"}
  cursor:
    request_parameter_cursor_name: name
    response_parameter_cursor_json_path: '$.response.station[0].next'
  parser:
    type: json
    root: '/response/station'
    flatten_json_array: true

out: {type: stdout}

Envnironment

  • Embulk: v0.11.0
  • embulk-input-http 16a4adf94e60caddc0ce590e16332a7111043a4b forked version.
  • embulk-parser-json: 0.4.0

The reason.

The results are separated by multiple HTTP responses. Those results are independent JSON objects like multiple files.
However, It constructs a single FileInputInputStream which contains multiple InputStreams.

embulk-parser-json just parses the first inputStream, as a result, It outputs three entires only.

It is the same issue embulk-parser-jsonpath

Most plugin's TransactionalFileInput has only one file (input stream), but the embulk specifications also supports multiple files (input streams) .
In the latter case, only the first file(input stream) is read in the current implementation.

Execution results.

embulk-input-http invoked the GET request six times.

2023-08-24 09:29:13.973 +0900 [INFO] (0015:task-0000): GET "http://express.heartrails.com/api/json?method=getStations&x=135.0&y=35"
2023-08-24 09:29:15.686 +0900 [INFO] (0015:task-0000): GET "http://express.heartrails.com/api/json?method=getStations&x=135.0&y=35&name=%E9%BB%92%E7%94%B0%E5%BA%84"
2023-08-24 09:29:15.754 +0900 [INFO] (0015:task-0000): GET "http://express.heartrails.com/api/json?method=getStations&x=135.0&y=35&name=%E6%9C%AC%E9%BB%92%E7%94%B0"
2023-08-24 09:29:15.799 +0900 [INFO] (0015:task-0000): GET "http://express.heartrails.com/api/json?method=getStations&x=135.0&y=35&name=%E8%88%B9%E7%94%BA%E5%8F%A3"
2023-08-24 09:29:15.840 +0900 [INFO] (0015:task-0000): GET "http://express.heartrails.com/api/json?method=getStations&x=135.0&y=35&name=%E4%B9%85%E4%B8%8B%E6%9D%91"
2023-08-24 09:29:15.986 +0900 [INFO] (0015:task-0000): GET "http://express.heartrails.com/api/json?method=getStations&x=135.0&y=35&name=%E8%B0%B7%E5%B7%9D"

Simulate with the curl command.

% curl -Lv 'http://express.heartrails.com/api/json?method=getStations&x=135.0&y=35'
*   Trying 35.75.165.181:80...
* Connected to express.heartrails.com (35.75.165.181) port 80 (#0)
> GET /api/json?method=getStations&x=135.0&y=35 HTTP/1.1
> Host: express.heartrails.com
> User-Agent: curl/7.88.1
> Accept: */*
>
< HTTP/1.1 200 OK
< Date: Thu, 24 Aug 2023 00:38:05 GMT
< Content-Type: application/json; charset=utf-8
< Content-Length: 572
< Connection: keep-alive
< Server: nginx
< Expires: Thu, 01 Dec 1994 16:00:00 GMT
< Pragma: no-cache
< X-Runtime: 1
< ETag: "952bd603b2f475e0e56ae31927adb679"
< Cache-Control: private, max-age=0, must-revalidate
< Access-Control-Allow-Origin: *
< Access-Control-Allow-Methods: GET, OPTIONS
< Access-Control-Allow-Headers: *
<
{"response":{"station":[{"name":"日本へそ公園","prefecture":"兵庫県","line":"JR加古川線","x":134.997633,"y":35.002069,"postal":"6770039","distance":"320m","prev":"比延","next":"黒田庄"},{"name":"比延","prefecture":"兵庫県","line":"JR加古川線","x":134.995733,"y":34.988773,"postal":"6770033","distance":"1310m","prev":"新西脇","next":"日本へそ公園"},{"name":"黒田庄","prefecture":"兵庫県","line":"JR加古川線","x":134.992522,"y":35.022689,"postal":"6790313","distance":"2620m","prev":"日本へそ公園","next":"本黒田"}]}}
* Connection #0 to host express.heartrails.com left intact
{
  "response": {
    "station": [
      {
        "name": "日本へそ公園",
        "prefecture": "兵庫県",
        "line": "JR加古川線",
        "x": 134.997633,
        "y": 35.002069,
        "postal": "6770039",
        "distance": "320m",
        "prev": "比延",
        "next": "黒田庄"
      },
      {
        "name": "比延",
        "prefecture": "兵庫県",
        "line": "JR加古川線",
        "x": 134.995733,
        "y": 34.988773,
        "postal": "6770033",
        "distance": "1310m",
        "prev": "新西脇",
        "next": "日本へそ公園"
      },
      {
        "name": "黒田庄",
        "prefecture": "兵庫県",
        "line": "JR加古川線",
        "x": 134.992522,
        "y": 35.022689,
        "postal": "6790313",
        "distance": "2620m",
        "prev": "日本へそ公園",
        "next": "本黒田"
      }
    ]
  }
}
% curl -Lv 'http://express.heartrails.com/api/json?method=getStations&x=135.0&y=35&name=%E9%BB%92%E7%94%B0%E5%BA%84'
*   Trying 35.75.165.181:80...
* Connected to express.heartrails.com (35.75.165.181) port 80 (#0)
> GET /api/json?method=getStations&x=135.0&y=35&name=%E9%BB%92%E7%94%B0%E5%BA%84 HTTP/1.1
> Host: express.heartrails.com
> User-Agent: curl/7.88.1
> Accept: */*
>
< HTTP/1.1 200 OK
< Date: Thu, 24 Aug 2023 00:39:18 GMT
< Content-Type: application/json; charset=utf-8
< Content-Length: 192
< Connection: keep-alive
< Server: nginx
< Expires: Thu, 01 Dec 1994 16:00:00 GMT
< Pragma: no-cache
< X-Runtime: 1
< ETag: "3be1f77accfba140aa48670d77eb6e97"
< Cache-Control: private, max-age=0, must-revalidate
< Access-Control-Allow-Origin: *
< Access-Control-Allow-Methods: GET, OPTIONS
< Access-Control-Allow-Headers: *
<
{"response":{"station":[{"name":"黒田庄","prefecture":"兵庫県","line":"JR加古川線","x":134.992522,"y":35.022689,"postal":"6790313","prev":"日本へそ公園","next":"本黒田"}]}}
* Connection #0 to host express.heartrails.com left intact
{
  "response": {
    "station": [
      {
        "name": "黒田庄",
        "prefecture": "兵庫県",
        "line": "JR加古川線",
        "x": 134.992522,
        "y": 35.022689,
        "postal": "6790313",
        "prev": "日本へそ公園",
        "next": "本黒田"
      }
    ]
  }
}

....

Example reproduce outputs

2023-08-24 18:00:12.915 +0900 [INFO] (main): m2_repo is set as a sub directory of embulk_home: /Users/user/.embulk/lib/m2/repository
2023-08-24 18:00:12.918 +0900 [INFO] (main): gem_home is set as a sub directory of embulk_home: /Users/user/.embulk/lib/gems
2023-08-24 18:00:12.918 +0900 [INFO] (main): gem_path is set empty.
2023-08-24 18:00:12.918 +0900 [DEBUG] (main): Embulk system property "default_guess_plugin" is set to: "gzip,bzip2,json,csv"
2023-08-24 18:00:13.049 +0900 [INFO] (main): Started Embulk v0.11.0
2023-08-24 18:00:14.752 +0900 [INFO] (0001:transaction): Gem's home and path are set by system configs "gem_home": "/Users/user/.embulk/lib/gems", "gem_path": ""
2023-08-24 18:00:15.364 +0900 [INFO] (0001:transaction): Loaded JRuby runtime 9.4.2.0
2023-08-24 18:00:15.395 +0900 [INFO] (0001:transaction): Loaded plugin embulk/input/http from a load path
2023-08-24 18:00:15.487 +0900 [INFO] (0001:transaction): Loaded plugin embulk-output-stdout
2023-08-24 18:00:15.538 +0900 [INFO] (0001:transaction): Loaded plugin embulk-parser-json
2023-08-24 18:00:15.687 +0900 [INFO] (0001:transaction): Using local thread executor with max_threads=16 / output tasks 8 = input tasks 1 * 8
2023-08-24 18:00:15.724 +0900 [INFO] (0001:transaction): {done:  0 / 1, running: 0}
2023-08-24 18:00:15.860 +0900 [INFO] (0015:task-0000): GET "http://express.heartrails.com/api/json?method=getStations&x=135.0&y=35"
2023-08-24 18:00:15.985 +0900 [INFO] (0015:task-0000): GET "http://express.heartrails.com/api/json?method=getStations&x=135.0&y=35&name=%E9%BB%92%E7%94%B0%E5%BA%84"
2023-08-24 18:00:16.007 +0900 [INFO] (0015:task-0000): GET "http://express.heartrails.com/api/json?method=getStations&x=135.0&y=35&name=%E6%9C%AC%E9%BB%92%E7%94%B0"
2023-08-24 18:00:16.029 +0900 [INFO] (0015:task-0000): GET "http://express.heartrails.com/api/json?method=getStations&x=135.0&y=35&name=%E8%88%B9%E7%94%BA%E5%8F%A3"
2023-08-24 18:00:16.050 +0900 [INFO] (0015:task-0000): GET "http://express.heartrails.com/api/json?method=getStations&x=135.0&y=35&name=%E4%B9%85%E4%B8%8B%E6%9D%91"
2023-08-24 18:00:16.085 +0900 [INFO] (0015:task-0000): GET "http://express.heartrails.com/api/json?method=getStations&x=135.0&y=35&name=%E8%B0%B7%E5%B7%9D"
{"prefecture":"兵庫県","distance":"320m","line":"JR加古川線","next":"黒田庄","prev":"比延","x":134.997633,"y":35.002069,"postal":"6770039","name":"日本へそ公園"}
{"prefecture":"兵庫県","distance":"1310m","line":"JR加古川線","next":"日本へそ公園","prev":"新西脇","x":134.995733,"y":34.988773,"postal":"6770033","name":"比延"}
{"prefecture":"兵庫県","distance":"2620m","line":"JR加古川線","next":"本黒田","prev":"日本へそ公園","x":134.992522,"y":35.022689,"postal":"6790313","name":"黒田庄"}
2023-08-24 18:00:16.162 +0900 [INFO] (0001:transaction): {done:  1 / 1, running: 0}
2023-08-24 18:00:16.167 +0900 [INFO] (main): Committed.
2023-08-24 18:00:16.167 +0900 [INFO] (main): Next config diff: {"in":{},"out":{}}

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions