Skip to content

Commit 4ff871a

Browse files
authored
Merge pull request #4 from katjabercic/feat/matrac-project
Matrač project -wikidata SPARQL query adjustments (list of known exclusions) -added fetching from related articles and keyword extraction (with rate limiting) -added local llm execution in order to categorize items -TECH: added Makefile with regular commands -TECH: added root requirements.txt that also installs dev dependencies -TECH: Fixed Github test action not to run twice when pushing on open PR (this PR for example) -Added feedback mechanism for categorization -Update README.MDs
2 parents 3356a3e + 8b1f738 commit 4ff871a

27 files changed

+1264
-44
lines changed

.flake8

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,3 @@
11
[flake8]
2-
exclude = experiments,migrations,settings.py
2+
exclude = experiments,migrations,settings.py,venv/
33
max-line-length = 88

.github/workflows/test.yml

Lines changed: 9 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,19 @@
11
name: Test
22

3-
on: [push, pull_request]
3+
on:
4+
push:
5+
branches:
6+
- main
7+
pull_request:
48

59
jobs:
610
test:
711
runs-on: ubuntu-latest
812
steps:
9-
- uses: actions/checkout@v2
10-
- uses: actions/setup-python@v2
13+
- uses: actions/checkout@v5
14+
- uses: actions/setup-python@v6
15+
with:
16+
python-version: '3.12.7'
1117
- run: pip install -r web/requirements.txt
1218
- run: pip install black isort flake8
1319
- run: python3 -m black --check .

.gitignore

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
11
venv
22
__pycache__
3-
web/db.sqlite3
3+
web/**/*.sqlite3
4+
**/.env

.python-version

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
3.12.7

Makefile

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
prepare-web:
2+
pip install -r web/requirements.txt
3+
cp web/.env.example web/.env
4+
python ./web/manage.py migrate
5+
python ./web/manage.py createsuperuser
6+
7+
install-dev:
8+
pip install -r requirements.txt
9+
10+
install-scispacy:
11+
pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.4/en_core_sci_lg-0.5.4.tar.gz
12+
13+
start:
14+
python ./web/manage.py runserver
15+
16+
populate-db:
17+
python ./web/manage.py import_wikidata
18+
19+
clear-db:
20+
python ./web/manage.py clear_wikidata
21+
22+
compute-concepts:
23+
python ./web/manage.py compute_concepts
24+
25+
categorize:
26+
python ./web/manage.py categorize --limit 10
27+
28+
fix-files:
29+
pip install -r requirements.txt
30+
python3 -m black .
31+
python3 -m isort .
32+
python3 -m flake8 .

README.md

Lines changed: 86 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -8,35 +8,87 @@ For a demonstration of a page with at least one link, see for example `{baseurl}
88

99
To install all the necessary Python packages, run:
1010

11-
pip install -r requirements.txt
11+
```bash
12+
make prepare-web # Which does the necessary steps for env, db, superuser
13+
# OR
14+
pip install -r web/requirements.txt
15+
```
16+
17+
Prepare an environment:
18+
```bash
19+
cp web/.env.example web/.env
20+
```
1221

1322
Next, to create a database, run:
1423

15-
python manage.py migrate
24+
```bash
25+
python manage.py migrate
26+
```
1627

1728
In order to use the administrative interface, you need to create an admin user:
1829

19-
python manage.py createsuperuser
30+
```bash
31+
python manage.py createsuperuser
32+
```
2033

2134
Finally, to populate the database, run
2235

23-
python manage.py import_wikidata
36+
```bash
37+
python manage.py import_wikidata
38+
# OR
39+
make populate-db
40+
```
41+
42+
* In order to fetch wikipedia articles and extract keywords from them:
43+
```bash
44+
make install-scispacy
45+
```
46+
then configure your email `WIKIPEDIA_CONTACT_EMAIL` in [source_wikidata.py](web/slurper/source_wikidata.py)
47+
* This is needed
48+
* Then run the database population (make sure your db is cleared)
49+
50+
2451

2552
If you ever want to repopulate the database, you can clear it using
2653

27-
python manage.py clear_wikidata
54+
```bash
55+
python manage.py clear_wikidata
56+
```
57+
58+
### To run the categorizer
59+
The categorizer is setup to work with several models, divided into free and paid.
60+
All of them are run locally, so expect some performance hits. The models are downloaded when the categorizer is
61+
ran initially, and by default the free models are used.
62+
63+
The database needs to be filled in before running it, so:
64+
```bash
65+
make populate-db
66+
```
67+
then
68+
```bash
69+
make categorize
70+
```
71+
72+
There are some known existing issues that have some inline fixes, such as `gpt2` getting stuck
73+
and returning the same prompt, then few times `---\n\n\n---`.
74+
75+
For more details see [categorizer readme](web/categorizer/README.md).
2876

2977
## Notes for developers
3078

3179
In order to contribute, install [Black](https://github.com/psf/black) and [isort](https://pycqa.github.io/isort/) autoformatters and [Flake8](https://flake8.pycqa.org/) linter.
32-
33-
pip install black isort flake8
80+
```bash
81+
make install-dev
82+
```
3483

3584
You can run all three with
36-
37-
isort .
38-
black .
39-
flake8
85+
```bash
86+
make fix-files
87+
# Or manually
88+
isort .
89+
black .
90+
flake8
91+
```
4092

4193
or set up a Git pre-commit hook by creating `.git/hooks/pre-commit` with the following contents:
4294

@@ -47,35 +99,37 @@ black . && isort . && flake8
4799
```
48100

49101
Each time after you change a model, make sure to create the appropriate migrations:
50-
51-
python manage.py makemigrations
102+
```bash
103+
python manage.py makemigrations
104+
```
52105

53106
To update the database with the new model, run:
54-
107+
```bash
55108
python manage.py migrate
109+
```
56110

57111
## Instructions for Katja to update the live version
58-
59-
sudo systemctl stop mathswitch
60-
cd mathswitch
61-
git pull
62-
source venv/bin/activate
63-
cd web
64-
./manage.py rebuild_db
65-
sudo systemctl start mathswitch
66-
112+
```bash
113+
sudo systemctl stop mathswitch
114+
cd mathswitch
115+
git pull
116+
source venv/bin/activate
117+
cd web
118+
./manage.py rebuild_db
119+
sudo systemctl start mathswitch
120+
```
67121
## WD item JSON example
68122

69-
```
123+
```json
70124
{
71-
'item': {'type': 'uri', 'value': 'http://www.wikidata.org/entity/Q192276'},
72-
'art': {'type': 'uri', 'value': 'https://en.wikipedia.org/wiki/Measure_(mathematics)'},
73-
'image': {'type': 'uri', 'value': 'http://commons.wikimedia.org/wiki/Special:FilePath/Measure%20illustration%20%28Vector%29.svg'},
74-
'mwID': {'type': 'literal', 'value': 'Measure'},
75-
'itemLabel': {'xml:lang': 'en', 'type': 'literal', 'value': 'measure'},
76-
'itemDescription': {'xml:lang': 'en', 'type': 'literal', 'value': 'function assigning numbers to some subsets of a set, which could be seen as a generalization of length, area, volume and integral'},
77-
'eomID': {'type': 'literal', 'value': 'measure'},
78-
'pwID': {'type': 'literal', 'value': 'Definition:Measure_(Measure_Theory)'
125+
"item": {"type": "uri", "value": "http://www.wikidata.org/entity/Q192276"},
126+
"art": {"type": "uri", "value": "https://en.wikipedia.org/wiki/Measure_(mathematics)"},
127+
"image": {"type": "uri", "value": "http://commons.wikimedia.org/wiki/Special:FilePath/Measure%20illustration%20%28Vector%29.svg"},
128+
"mwID": {"type": "literal", "value": "Measure"},
129+
"itemLabel": {"xml:lang": "en", "type": "literal", "value": "measure"},
130+
"itemDescription": {"xml:lang": "en", "type": "literal", "value": "function assigning numbers to some subsets of a set, which could be seen as a generalization of length, area, volume and integral"},
131+
"eomID": {"type": "literal", "value": "measure"},
132+
"pwID": {"type": "literal", "value": "Definition:Measure_(Measure_Theory)"}
79133
}
80134
```
81135

requirements.txt

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
black~=25.9.0
2+
isort~=5.12.0
3+
flake8~=7.3.0
4+
5+
-r ./web/requirements.txt

web/.env.example

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
SECRET_KEY="django-insecure-9wy9w#vf^tde0262doyy_j19=64c()_qub!1)f+fh-b^=7ndw*"
2+
WIKIPEDIA_CONTACT_EMAIL=my@email.com

web/categorizer/README.md

Lines changed: 141 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,141 @@
1+
# Categorizer Module
2+
3+
The categorizer module provides LLM-powered categorization of mathematical concepts.
4+
5+
## Setup
6+
7+
### 1. Install Required Dependencies
8+
9+
**For FREE local models (recommended):**
10+
```bash
11+
make install
12+
```
13+
14+
**For paid API models (optional):**
15+
16+
For OpenAI:
17+
```bash
18+
pip install openai
19+
```
20+
21+
For Anthropic Claude:
22+
```bash
23+
pip install anthropic
24+
```
25+
26+
**For Ollama (free local alternative):**
27+
1. Install Ollama from https://ollama.ai
28+
2. Install langchain-community: `pip install langchain-community`
29+
3. Pull a model: `ollama pull llama2`
30+
31+
### 2. Configure API Keys (only for paid models)
32+
33+
Set the appropriate environment variable for your chosen LLM provider:
34+
35+
**For OpenAI:**
36+
```bash
37+
export OPENAI_API_KEY="your-openai-api-key-here"
38+
```
39+
40+
**For Anthropic Claude:**
41+
```bash
42+
export ANTHROPIC_API_KEY="your-anthropic-api-key-here"
43+
```
44+
45+
**For Ollama (optional):**
46+
```bash
47+
export OLLAMA_MODEL="llama2" # Default is llama2
48+
```
49+
50+
You can also add these to a `.env` file or your shell configuration file (`.bashrc`, `.zshrc`, etc.).
51+
52+
## Usage
53+
54+
### Basic Usage
55+
56+
Categorize all items using the default FREE LLM (HuggingFace FLAN-T5):
57+
```bash
58+
python manage.py categorize
59+
```
60+
61+
### With Options
62+
63+
Categorize a limited number of items:
64+
```bash
65+
python manage.py categorize --limit 10
66+
make categorize
67+
# OR
68+
```
69+
70+
Use a specific LLM provider:
71+
72+
**FREE models (run locally):**
73+
```bash
74+
# Use HuggingFace FLAN-T5 (default, free, good for instruction following)
75+
python manage.py categorize --llm huggingface_flan_t5
76+
77+
# Use HuggingFace GPT-2 (free, generative model)
78+
python manage.py categorize --llm huggingface_gpt2
79+
80+
# Use HuggingFace DialoGPT (free, conversational model)
81+
python manage.py categorize --llm huggingface_dialogpt
82+
83+
# Use Ollama (free, requires Ollama installed)
84+
python manage.py categorize --llm ollama
85+
```
86+
87+
**Paid API models:**
88+
```bash
89+
# Use OpenAI GPT-4 (requires API key)
90+
python manage.py categorize --llm openai_gpt4
91+
92+
# Use OpenAI GPT-3.5 Turbo (requires API key)
93+
python manage.py categorize --llm openai_gpt35
94+
95+
# Use Anthropic Claude (requires API key)
96+
python manage.py categorize --llm anthropic_claude
97+
```
98+
99+
Combine options:
100+
```bash
101+
python manage.py categorize --limit 5 --llm huggingface_flan_t5
102+
```
103+
104+
## Architecture
105+
106+
- `categorizer_service.py` - Main service for categorizing items
107+
- `llm_service.py` - Service for calling various LLM APIs
108+
- `management/commands/categorize.py` - Django management command
109+
110+
## Supported LLMs
111+
112+
### Free Models (No API Key Required)
113+
1. **HuggingFace FLAN-T5** - Google's instruction-following model (recommended for tasks)
114+
2. **HuggingFace GPT-2** - OpenAI's classic generative model
115+
3. **HuggingFace DialoGPT** - Microsoft's conversational model
116+
4. **Ollama** - Run any Ollama model locally (llama2, mistral, etc.)
117+
118+
### Paid API Models (Require API Key)
119+
1. **OpenAI GPT-4** - Most capable, but expensive
120+
2. **OpenAI GPT-3.5 Turbo** - Fast and cheaper than GPT-4
121+
3. **Anthropic Claude** - High quality, good reasoning
122+
123+
## Performance Notes
124+
125+
- **Free models** run locally and don't require internet/API keys, but:
126+
- First run downloads the model (~1-3GB depending on model)
127+
- Requires sufficient RAM (4-8GB+ recommended)
128+
- Slower than API models (especially without GPU)
129+
130+
- **API models** are faster but cost money per request
131+
132+
- **Ollama** is a good middle ground - free, local, and supports many models
133+
134+
## Extending
135+
136+
To add support for additional LLM providers:
137+
138+
1. Add a new entry to the `LLMType` enum in `llm_service.py`
139+
2. Implement a new private method (e.g., `_call_new_provider`) in the `LLMService` class
140+
3. Add the new provider to the `call_llm` method's conditional logic
141+
4. Update the command choices in `management/commands/categorize.py`

web/categorizer/__init__.py

Whitespace-only changes.

0 commit comments

Comments
 (0)