Skip to content

Milestones

List view

  • Having a first setup for the project in terms of - data aggregation - data processing - visualization the next steps have become clearer. First, using the experimental HTML from arxiv is not really working, because it is only available on the main "arxiv.org" and not on "arxiv.export.org" which makes unusable for data mining. That is because not even with respecting the request rate limit of 1 request / 3 seconds, I can prevent ending up on the robot detection list and thefore getting locally blocked from accessing the arxiv data. Therefore, this implies two things: 1. switch back to PDF parsing or even source (tex, bib, etc..) 2. switch to arxiv.export (4 request per second) or use bulk data access. Secondly, the visualization is still very "useless" in the sense that we only get nodes with arrows, but we cannot get back the info on which paper is which. Looking into the future, it makes most sense to switch to a WebApp to make it platform independent and easily embeddable into a website, etc... Third, it Natural Language Processing (NLP) will be explored and maybe implement a first indexing method (see Tsoding streams).

    Overdue by 8 month(s)
    Due by April 25, 2025
    0/3 issues closed