Skip to content

merge rexport, pushshift and gdpr reddit data #89

Open
@purarue

Description

@purarue

Was able to get pushshift as mentioned in the README working to export old comments. Thought I'd mention it here.

It's possible to use pushshift to get data further back, but I'm not sure if it should be part of this project, since some of the older comments don't have the same JSON structure, I can only assume pushshift is getting data from multiple sources the further it goes back. It requires some normalization, like here and here.

The only data I was missing due to the 1000 limit on queries from using rexport were comments. It exported the last 1000 but I have about 5000 on reddit in total.

Regarding HPI:

Wrote a simple package to request/save that data, with a dal (whose PComment NamedTuple has similar @property attributes to rexports DAL), and a merge function, and now:

In [15]: from my.reddit import comments, _dal, pushshift_comments

In [16]: len(list(_dal().comments())) # from dal.py in rexport
Out[16]: 999

In [17]: len(list(pushshift_comments())) # from pushshift
Out[17]: 4891

In [18]: len(list(comments())) # merged data, using utc_time to remove duplicates
Out[18]: 4893

In [19]: comments
Out[19]: <function my.reddit.comments() -> Iterator[Union[rexport.dal.Comment, pushshift_comment_export.dal.PComment]]>

Its possible that one could write enough @property wrappers to handle the differences in the JSON representations of old pushshift data, unsure if thats something you want to pursue here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    patternsPatterns of working with/writing HPI modulesreddit

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions