Description
Was able to get pushshift
as mentioned in the README working to export old comments. Thought I'd mention it here.
It's possible to use pushshift to get data further back, but I'm not sure if it should be part of this project, since some of the older comments don't have the same JSON structure, I can only assume pushshift is getting data from multiple sources the further it goes back. It requires some normalization, like here and here.
The only data I was missing due to the 1000 limit on queries from using rexport
were comments. It exported the last 1000 but I have about 5000 on reddit in total.
Regarding HPI:
Wrote a simple package to request/save that data, with a dal
(whose PComment
NamedTuple has similar @property
attributes to rexports
DAL), and a merge function, and now:
In [15]: from my.reddit import comments, _dal, pushshift_comments
In [16]: len(list(_dal().comments())) # from dal.py in rexport
Out[16]: 999
In [17]: len(list(pushshift_comments())) # from pushshift
Out[17]: 4891
In [18]: len(list(comments())) # merged data, using utc_time to remove duplicates
Out[18]: 4893
In [19]: comments
Out[19]: <function my.reddit.comments() -> Iterator[Union[rexport.dal.Comment, pushshift_comment_export.dal.PComment]]>
Its possible that one could write enough @property
wrappers to handle the differences in the JSON representations of old pushshift data, unsure if thats something you want to pursue here.