Use pysimdjson for parsing wat records#49
Use pysimdjson for parsing wat records#49silentninja wants to merge 1 commit intocommoncrawl:mainfrom
Conversation
sebastian-nagel
left a comment
There was a problem hiding this comment.
Hi @silentninja, thanks for the PR and testing it out.
Because of the observed incompatibilities, it looks like a drop-in use of simdjson isn't recommend.
If we use simdjson.loads(json_blob) or the parse method without recursion (simdjson.Parser().parse(json_blob, False), the performance gains are mostly lost.
I'm currently running a couple of performance tests and will report back about them on issue #41.
| if self.is_wat_json_record(record): | ||
| # WAT (response) record | ||
| record = json.loads(self.get_payload_stream(record).read()) | ||
| record = self.json_extractor.parse(self.get_payload_stream(record).read()) |
There was a problem hiding this comment.
Unfortunately, processing the JSON may raise an exception:
File "/mnt/data/wastl/proj/cc/git/cc-pyspark/sparkcc.py", line 377, in iterate_records
for res in self.process_record(record):
~~~~~~~~~~~~~~~~~~~^^^^^^^^
File "/mnt/data/wastl/proj/cc/git/cc-pyspark/server_count.py", line 42, in process_record
server_names.append(headers[header].strip())
^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'csimdjson.Array' object has no attribute 'strip'
Here a short snippet why this happens:
>>> import simdjson
>>> type(simdjson.Parser().parse('[1,2]'))
<class 'csimdjson.Array'>
>>> type(simdjson.Parser().parse('[1,2]', True))
<class 'list'>
There was a problem hiding this comment.
Would need to add an extra check:
... or isinstance(headers[header], simdjson.Array)
But then it's no drop-in replacement anymore.
| try: | ||
| import simdjson | ||
| self.json = simdjson.Parser() | ||
| self.parse = self.json.parse |
There was a problem hiding this comment.
Could write:
self.parse = lambda j: self.json.parse(j, True)
to force recursive parsing and avoid incompatibilities.
However, then one of the major performance benefits of the simdjson module fades away.
| if not l: | ||
| continue | ||
| if url_attr in l: | ||
| if url_attr is not None and url_attr in l: |
There was a problem hiding this comment.
Good observation and thanks for testing this. Unfortunately, it's not the only incompatibility.
Fixes #41
Made changes to the examples to use pysimdjson for parsing wat records and avoid causing the error mentioned in TkTech/pysimdjson#122.