This project is a web interface that attempts to tame the overwhelming flood of papers on Arxiv. It allows researchers to keep track of recent papers, search for papers, sort papers by similarity to any paper, see recent popular papers, to add papers to a personal library, and to get personalized recommendations of (new or old) Arxiv papers. This code is currently running live at www.arxiv-sanity.com/, where it's serving 25,000+ Arxiv papers from Machine Learning (cs.[AI|CL|CV|LG|NE|SD]/eess.[AS|IV]/stat.ML) over all years. With this code base you could replicate the website to any of your favorite subsets of Arxiv by simply changing the categories in fetch_papers.py.
fetch_papers.pyis for query arxiv API and create a filedb.pthat contains all information for each paper.download_pdfs.pyis for iterate over all papers in parsed pickle and downloads the papers into folderpdf.thumb_pdf.pyis for export thumbnails of all downloaded pdfs tothumbpictures.analyze.pyis for compute tfidf based on fetch info and save totfidf.p,tfidf_meta.pandsim_dict.p.buildsvm.pyis for train SVMs for all users (if any), exports a pickleuser_sim.pmake_cache.pyis for save some fast searching data based on previous data and save todb2.pfile.twitter_daemon.pyis optional, which uses your Twitter API credentials (stored intwitter.txt) to query Twitter periodically looking for mentions of papers in the database, and writes the results to the pickle filetwitter.p.
serve.pyis for running a server
several software you need to install:
- Python 3: because all codes below depends on it
- ImageMagick :convert pdf to thumbnail
- Ghostscript :
imagemagickneed it for pdf converting - Mongodb :save infos from twitter
- sqlite-tools :save infos of registered users
$ pip install -r requirements.txtall_in_one.py contains all data preparing part mentioned above, so just running all_in_one.py to do fetching,downloading,analyzing etc.:
python all_in_one.pyRun python serve.py and visit your_ip:5000. you can change port by using port parameter.
If you'd like to run server to outer world (e.g. AWS) run it as python serve.py --prod to use tornado instead of flask.
You also want to create a secret_key.txt file and fill it with random text (see top of serve.py).
