katjabercic
diff --git a/‎.flake8‎
Lines changed: 1 addition & 1 deletion b/‎.flake8‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎.github/workflows/test.yml‎
Lines changed: 9 additions & 3 deletions b/‎.github/workflows/test.yml‎
Lines changed: 9 additions & 3 deletions
diff --git a/‎.gitignore‎
Lines changed: 2 additions & 1 deletion b/‎.gitignore‎
Lines changed: 2 additions & 1 deletion
diff --git a/‎.python-version‎
Lines changed: 1 addition & 0 deletions b/‎.python-version‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎Makefile‎
Lines changed: 32 additions & 0 deletions b/‎Makefile‎
Lines changed: 32 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 86 additions & 32 deletions b/‎README.md‎
Lines changed: 86 additions & 32 deletions
diff --git a/‎requirements.txt‎
Lines changed: 5 additions & 0 deletions b/‎requirements.txt‎
Lines changed: 5 additions & 0 deletions
diff --git a/‎web/.env.example‎
Lines changed: 2 additions & 0 deletions b/‎web/.env.example‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎web/categorizer/README.md‎
Lines changed: 141 additions & 0 deletions b/‎web/categorizer/README.md‎
Lines changed: 141 additions & 0 deletions
diff --git a/‎web/categorizer/__init__.py‎ b/‎web/categorizer/__init__.py‎
@@ -1,3 +1,3 @@
 [flake8]
-exclude = experiments,migrations,settings.py
+exclude = experiments,migrations,settings.py,venv/
 max-line-length = 88
@@ -1,13 +1,19 @@
 name: Test
 
-on: [push, pull_request]
+on:
+  push:
+    branches:
+      - main
+  pull_request:
 
 jobs:
   test:
     runs-on: ubuntu-latest
     steps:
-      - uses: actions/checkout@v2
-      - uses: actions/setup-python@v2
+      - uses: actions/checkout@v5
+      - uses: actions/setup-python@v6
+        with:
+          python-version: '3.12.7'
       - run: pip install -r web/requirements.txt 
       - run: pip install black isort flake8
       - run: python3 -m black --check .
 
@@ -1,3 +1,4 @@
 venv
 __pycache__
-web/db.sqlite3
+web/**/*.sqlite3
+**/.env
@@ -0,0 +1 @@
+3.12.7
@@ -0,0 +1,32 @@
+prepare-web:
+	pip install -r web/requirements.txt
+	cp web/.env.example web/.env
+	python ./web/manage.py migrate
+	python ./web/manage.py createsuperuser
+
+install-dev:
+	pip install -r requirements.txt
+
+install-scispacy:
+	pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.4/en_core_sci_lg-0.5.4.tar.gz
+
+start:
+	python ./web/manage.py runserver
+
+populate-db:
+	python ./web/manage.py import_wikidata
+
+clear-db:
+	python ./web/manage.py clear_wikidata
+
+compute-concepts:
+	python ./web/manage.py compute_concepts
+
+categorize:
+	python ./web/manage.py categorize --limit 10
+
+fix-files:
+	pip install -r requirements.txt
+	python3 -m black .
+	python3 -m isort .
+	python3 -m flake8 .
@@ -8,35 +8,87 @@ For a demonstration of a page with at least one link, see for example `{baseurl}
 
 To install all the necessary Python packages, run:
 
-    pip install -r requirements.txt
+```bash
+make prepare-web # Which does the necessary steps for env, db, superuser
+# OR
+pip install -r web/requirements.txt
+```
+
+Prepare an environment:
+```bash
+cp web/.env.example web/.env
+```
 
 Next, to create a database, run:
 
-    python manage.py migrate
+```bash
+python manage.py migrate
+```
 
 In order to use the administrative interface, you need to create an admin user:
 
-    python manage.py createsuperuser
+```bash
+python manage.py createsuperuser
+```
 
 Finally, to populate the database, run
 
-    python manage.py import_wikidata
+```bash
+python manage.py import_wikidata
+# OR
+make populate-db
+```
+
+  * In order to fetch wikipedia articles and extract keywords from them:
+    ```bash
+    make install-scispacy
+    ```
+    then configure your email `WIKIPEDIA_CONTACT_EMAIL` in [source_wikidata.py](web/slurper/source_wikidata.py)
+    * This is needed
+  * Then run the database population (make sure your db is cleared)
+
+
 
 If you ever want to repopulate the database, you can clear it using
 
-    python manage.py clear_wikidata
+```bash
+python manage.py clear_wikidata
+```
+
+### To run the categorizer
+The categorizer is setup to work with several models, divided into free and paid.
+All of them are run locally, so expect some performance hits. The models are downloaded when the categorizer is
+ran initially, and by default the free models are used.
+
+The database needs to be filled in before running it, so:
+```bash
+make populate-db
+```
+then
+```bash
+make categorize
+```
+
+There are some known existing issues that have some inline fixes, such as `gpt2` getting stuck
+and returning the same prompt, then few times `---\n\n\n---`.
+
+For more details see [categorizer readme](web/categorizer/README.md).
 
 ## Notes for developers
 
 In order to contribute, install [Black](https://github.com/psf/black) and [isort](https://pycqa.github.io/isort/) autoformatters and [Flake8](https://flake8.pycqa.org/) linter.
-
-    pip install black isort flake8
+```bash
+make install-dev
+```
 
 You can run all three with
-
-    isort .
-    black .
-    flake8
+```bash
+make fix-files
+# Or manually
+isort .
+black .
+flake8
+```
 
 or set up a Git pre-commit hook by creating `.git/hooks/pre-commit` with the following contents:
 
@@ -47,35 +99,37 @@ black . && isort . && flake8
 ```
 
 Each time after you change a model, make sure to create the appropriate migrations:
-
-    python manage.py makemigrations
+```bash
+python manage.py makemigrations
+```
 
 To update the database with the new model, run:
-
+```bash
     python manage.py migrate
+```
 
 ## Instructions for Katja to update the live version
-
-  sudo systemctl stop mathswitch
-  cd mathswitch
-  git pull
-  source venv/bin/activate
-  cd web
-  ./manage.py rebuild_db
-  sudo systemctl start mathswitch
-
+```bash
+sudo systemctl stop mathswitch
+cd mathswitch
+git pull
+source venv/bin/activate
+cd web
+./manage.py rebuild_db
+sudo systemctl start mathswitch
+```
 ## WD item JSON example
 
-```
+```json
 {
-    'item': {'type': 'uri', 'value': 'http://www.wikidata.org/entity/Q192276'}, 
-    'art': {'type': 'uri', 'value': 'https://en.wikipedia.org/wiki/Measure_(mathematics)'}, 
-    'image': {'type': 'uri', 'value': 'http://commons.wikimedia.org/wiki/Special:FilePath/Measure%20illustration%20%28Vector%29.svg'}, 
-    'mwID': {'type': 'literal', 'value': 'Measure'}, 
-    'itemLabel': {'xml:lang': 'en', 'type': 'literal', 'value': 'measure'}, 
-    'itemDescription': {'xml:lang': 'en', 'type': 'literal', 'value': 'function assigning numbers to some subsets of a set, which could be seen as a generalization of length, area, volume and integral'}, 
-    'eomID': {'type': 'literal', 'value': 'measure'}, 
-    'pwID': {'type': 'literal', 'value': 'Definition:Measure_(Measure_Theory)'
+    "item": {"type": "uri", "value": "http://www.wikidata.org/entity/Q192276"}, 
+    "art": {"type": "uri", "value": "https://en.wikipedia.org/wiki/Measure_(mathematics)"}, 
+    "image": {"type": "uri", "value": "http://commons.wikimedia.org/wiki/Special:FilePath/Measure%20illustration%20%28Vector%29.svg"}, 
+    "mwID": {"type": "literal", "value": "Measure"}, 
+    "itemLabel": {"xml:lang": "en", "type": "literal", "value": "measure"}, 
+    "itemDescription": {"xml:lang": "en", "type": "literal", "value": "function assigning numbers to some subsets of a set, which could be seen as a generalization of length, area, volume and integral"}, 
+    "eomID": {"type": "literal", "value": "measure"}, 
+    "pwID": {"type": "literal", "value": "Definition:Measure_(Measure_Theory)"}
 }
 ```
 
 
@@ -0,0 +1,5 @@
+black~=25.9.0
+isort~=5.12.0
+flake8~=7.3.0
+
+-r ./web/requirements.txt
@@ -0,0 +1,2 @@
+SECRET_KEY="django-insecure-9wy9w#vf^tde0262doyy_j19=64c()_qub!1)f+fh-b^=7ndw*"
+WIKIPEDIA_CONTACT_EMAIL=my@email.com
@@ -0,0 +1,141 @@
+# Categorizer Module
+
+The categorizer module provides LLM-powered categorization of mathematical concepts.
+
+## Setup
+
+### 1. Install Required Dependencies
+
+**For FREE local models (recommended):**
+```bash
+make install
+```
+
+**For paid API models (optional):**
+
+For OpenAI:
+```bash
+pip install openai
+```
+
+For Anthropic Claude:
+```bash
+pip install anthropic
+```
+
+**For Ollama (free local alternative):**
+1. Install Ollama from https://ollama.ai
+2. Install langchain-community: `pip install langchain-community`
+3. Pull a model: `ollama pull llama2`
+
+### 2. Configure API Keys (only for paid models)
+
+Set the appropriate environment variable for your chosen LLM provider:
+
+**For OpenAI:**
+```bash
+export OPENAI_API_KEY="your-openai-api-key-here"
+```
+
+**For Anthropic Claude:**
+```bash
+export ANTHROPIC_API_KEY="your-anthropic-api-key-here"
+```
+
+**For Ollama (optional):**
+```bash
+export OLLAMA_MODEL="llama2"  # Default is llama2
+```
+
+You can also add these to a `.env` file or your shell configuration file (`.bashrc`, `.zshrc`, etc.).
+
+## Usage
+
+### Basic Usage
+
+Categorize all items using the default FREE LLM (HuggingFace FLAN-T5):
+```bash
+python manage.py categorize
+```
+
+### With Options
+
+Categorize a limited number of items:
+```bash
+python manage.py categorize --limit 10
+make categorize
+# OR
+```
+
+Use a specific LLM provider:
+
+**FREE models (run locally):**
+```bash
+# Use HuggingFace FLAN-T5 (default, free, good for instruction following)
+python manage.py categorize --llm huggingface_flan_t5
+
+# Use HuggingFace GPT-2 (free, generative model)
+python manage.py categorize --llm huggingface_gpt2
+
+# Use HuggingFace DialoGPT (free, conversational model)
+python manage.py categorize --llm huggingface_dialogpt
+
+# Use Ollama (free, requires Ollama installed)
+python manage.py categorize --llm ollama
+```
+
+**Paid API models:**
+```bash
+# Use OpenAI GPT-4 (requires API key)
+python manage.py categorize --llm openai_gpt4
+
+# Use OpenAI GPT-3.5 Turbo (requires API key)
+python manage.py categorize --llm openai_gpt35
+
+# Use Anthropic Claude (requires API key)
+python manage.py categorize --llm anthropic_claude
+```
+
+Combine options:
+```bash
+python manage.py categorize --limit 5 --llm huggingface_flan_t5
+```
+
+## Architecture
+
+- `categorizer_service.py` - Main service for categorizing items
+- `llm_service.py` - Service for calling various LLM APIs
+- `management/commands/categorize.py` - Django management command
+
+## Supported LLMs
+
+### Free Models (No API Key Required)
+1. **HuggingFace FLAN-T5** - Google's instruction-following model (recommended for tasks)
+2. **HuggingFace GPT-2** - OpenAI's classic generative model
+3. **HuggingFace DialoGPT** - Microsoft's conversational model
+4. **Ollama** - Run any Ollama model locally (llama2, mistral, etc.)
+
+### Paid API Models (Require API Key)
+1. **OpenAI GPT-4** - Most capable, but expensive
+2. **OpenAI GPT-3.5 Turbo** - Fast and cheaper than GPT-4
+3. **Anthropic Claude** - High quality, good reasoning
+
+## Performance Notes
+
+- **Free models** run locally and don't require internet/API keys, but:
+  - First run downloads the model (~1-3GB depending on model)
+  - Requires sufficient RAM (4-8GB+ recommended)
+  - Slower than API models (especially without GPU)
+
+- **API models** are faster but cost money per request
+
+- **Ollama** is a good middle ground - free, local, and supports many models
+
+## Extending
+
+To add support for additional LLM providers:
+
+1. Add a new entry to the `LLMType` enum in `llm_service.py`
+2. Implement a new private method (e.g., `_call_new_provider`) in the `LLMService` class
+3. Add the new provider to the `call_llm` method's conditional logic
+4. Update the command choices in `management/commands/categorize.py`
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,2 @@`
	`1`	`+SECRET_KEY="django-insecure-9wy9w#vf^tde0262doyy_j19=64c()_qub!1)f+fh-b^=7ndw*"`
	`2`	`+WIKIPEDIA_CONTACT_EMAIL=my@email.com`