-
Notifications
You must be signed in to change notification settings - Fork 0
Description
It seems that the API is returning malformed JSON and this is a huge problem for large subreddits, as we have to restart the data collection, and possibly never finish the work as it keeps crashing along the way. Would it be possible to add better error handling so that the script keeps on going when it hits these JSONs, and keep a record of that in the logs?
Here are two examples.
Exception Group Traceback (most recent call last):
| File "C:\Users\admin_local\Desktop\mmm\bsa.py", line 9, in
| result2 = asyncio.run(ppa.get_submissions(subreddit='researchchemicals',
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
| File "C:\Users\admin_local\AppData\Local\Programs\Python\Python312\Lib\asyncio\runners.py", line 194, in run
| return runner.run(main)
| ^^^^^^^^^^^^^^^^
| File "C:\Users\admin_local\AppData\Local\Programs\Python\Python312\Lib\asyncio\runners.py", line 118, in run
| return self._loop.run_until_complete(task)
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
| File "C:\Users\admin_local\AppData\Local\Programs\Python\Python312\Lib\asyncio\base_events.py", line 687, in run_until_complete
| return future.result()
| ^^^^^^^^^^^^^^^
| File "C:\Users\admin_local\AppData\Local\Programs\Python\Python312\Lib\site-packages\BAScraper\BAScraper_async.py", line 181, in get_submissions
| comments = await self._get_link_ids_comments(submission_ids)
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
| File "C:\Users\admin_local\AppData\Local\Programs\Python\Python312\Lib\site-packages\BAScraper\BAScraper_async.py", line 325, in _get_link_ids_comments
| raise err
| File "C:\Users\admin_local\AppData\Local\Programs\Python\Python312\Lib\site-packages\BAScraper\BAScraper_async.py", line 310, in _get_link_ids_comments
| async with asyncio.TaskGroup() as tg:
| ^^^^^^^^^^^^^^^^^^^
| File "C:\Users\admin_local\AppData\Local\Programs\Python\Python312\Lib\asyncio\taskgroups.py", line 145, in aexit
| raise me from None
| ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)
+-+---------------- 1 ----------------
| Traceback (most recent call last):
| File "C:\Users\admin_local\AppData\Local\Programs\Python\Python312\Lib\site-packages\BAScraper\utils.py", line 153, in make_request
| result = await response.json()
| ^^^^^^^^^^^^^^^^^^^^^
| File "C:\Users\admin_local\AppData\Local\Programs\Python\Python312\Lib\site-packages\aiohttp\client_reqrep.py", line 1199, in json
| raise ContentTypeError(
| aiohttp.client_exceptions.ContentTypeError: 502, message='Attempt to decode JSON with unexpected mimetype: text/html', url='https://api.pullpush.io/reddit/search/comment/?link_id=10qlmx7'
|
| During handling of the above exception, another exception occurred:
|
| Traceback (most recent call last):
| File "C:\Users\admin_local\AppData\Local\Programs\Python\Python312\Lib\site-packages\BAScraper\BAScraper_async.py", line 305, in link_id_worker
| res.append(await make_request(self, 'comments', link_id=link_id))
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
| File "C:\Users\admin_local\AppData\Local\Programs\Python\Python312\Lib\site-packages\BAScraper\utils.py", line 198, in make_request
| raise Exception(f'{coro_name} | unexpected error: \n{err}')
| Exception: coro-0 | unexpected error:
502, message='Attempt to decode JSON with unexpected mimetype: text/html', url='https://api.pullpush.io/reddit/search/comment/?link_id=10qlmx7'
Here is another example:
- Exception Group Traceback (most recent call last):
| File "C:\Users\admin_local\Desktop\mmm\bsa.py", line 9, in
| result2 = asyncio.run(ppa.get_submissions(subreddit='Drugs',
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
| File "C:\Users\admin_local\AppData\Local\Programs\Python\Python312\Lib\asyncio\runners.py", line 194, in run
| return runner.run(main)
| ^^^^^^^^^^^^^^^^
| File "C:\Users\admin_local\AppData\Local\Programs\Python\Python312\Lib\asyncio\runners.py", line 118, in run
| return self._loop.run_until_complete(task)
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
| File "C:\Users\admin_local\AppData\Local\Programs\Python\Python312\Lib\asyncio\base_events.py", line 687, in run_until_complete
| return future.result()
| ^^^^^^^^^^^^^^^
| File "C:\Users\admin_local\AppData\Local\Programs\Python\Python312\Lib\site-packages\BAScraper\BAScraper_async.py", line 181, in get_submissions
| comments = await self._get_link_ids_comments(submission_ids)
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
| File "C:\Users\admin_local\AppData\Local\Programs\Python\Python312\Lib\site-packages\BAScraper\BAScraper_async.py", line 325, in _get_link_ids_comments
| raise err
| File "C:\Users\admin_local\AppData\Local\Programs\Python\Python312\Lib\site-packages\BAScraper\BAScraper_async.py", line 310, in _get_link_ids_comments
| async with asyncio.TaskGroup() as tg:
| ^^^^^^^^^^^^^^^^^^^
| File "C:\Users\admin_local\AppData\Local\Programs\Python\Python312\Lib\asyncio\taskgroups.py", line 145, in aexit
| raise me from None
| ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)
+-+---------------- 1 ----------------
| Traceback (most recent call last):
| File "C:\Users\admin_local\AppData\Local\Programs\Python\Python312\Lib\site-packages\BAScraper\utils.py", line 153, in make_request
| result = await response.json()
| ^^^^^^^^^^^^^^^^^^^^^
| File "C:\Users\admin_local\AppData\Local\Programs\Python\Python312\Lib\site-packages\aiohttp\client_reqrep.py", line 1199, in json
| raise ContentTypeError(
| aiohttp.client_exceptions.ContentTypeError: 502, message='Attempt to decode JSON with unexpected mimetype: text/html', url='https://api.pullpush.io/reddit/search/comment/?link_id=14na6nc'
|
| During handling of the above exception, another exception occurred:
|
| Traceback (most recent call last):
| File "C:\Users\admin_local\AppData\Local\Programs\Python\Python312\Lib\site-packages\BAScraper\BAScraper_async.py", line 305, in link_id_worker
| res.append(await make_request(self, 'comments', link_id=link_id))
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
| File "C:\Users\admin_local\AppData\Local\Programs\Python\Python312\Lib\site-packages\BAScraper\utils.py", line 198, in make_request
| raise Exception(f'{coro_name} | unexpected error: \n{err}')
| Exception: coro-0 | unexpected error:
502, message='Attempt to decode JSON with unexpected mimetype: text/html', url='https://api.pullpush.io/reddit/search/comment/?link_id=14na6nc'