|
| 1 | +# RUW Database Overview |
| 2 | + |
| 3 | +Live dashboard: **https://hcss-utils.github.io/ruw-warehouse-overview/** |
| 4 | + |
| 5 | +A lightweight status page for the **Russian-Ukrainian War (RUW) Database**. The page shows per-source document and chunk counts, annotation coverage, and freshness at a glance. |
| 6 | + |
| 7 | +## What it shows |
| 8 | + |
| 9 | +Each row in the table corresponds to one source database (Telegram channels, Kremlin, ISW, Integrum, etc.). For every source the page displays: |
| 10 | + |
| 11 | +| Column | Description | |
| 12 | +|---|---| |
| 13 | +| Lang | Languages present in the source (up to 5) | |
| 14 | +| Documents | Total number of ingested documents | |
| 15 | +| Chunks | Total text chunks produced from those documents | |
| 16 | +| Relevant chunks | Chunks annotated as relevant, with coverage % | |
| 17 | +| Last Updated | Date of the most recent document in the source | |
| 18 | + |
| 19 | +Summary cards at the top aggregate totals across all sources and show the latest data date across the entire corpus. |
| 20 | + |
| 21 | +## How it works |
| 22 | + |
| 23 | +### SQL logic |
| 24 | + |
| 25 | +`assets/stats.sql` runs a single query against `public.uploaded_document` joined through `document_section` → `document_section_chunk`. Relevant chunks are determined by combining two annotation tables: |
| 26 | + |
| 27 | +- `taxonomy` — legacy classifications |
| 28 | +- `taxonomy_annotation` — current pipeline annotations (`is_relevant = true` and `HLTP IS NOT NULL`) |
| 29 | + |
| 30 | +The two are merged with `UNION` (deduplicating chunk IDs) before counting. Language codes are aggregated with `STRING_AGG(DISTINCT UPPER(...))` per source. |
| 31 | + |
| 32 | +### Technical pipeline |
| 33 | + |
| 34 | +``` |
| 35 | +schedule (daily 06:00 UTC) or push to main |
| 36 | + │ |
| 37 | + ▼ |
| 38 | +main.py |
| 39 | + ├── reads assets/stats.sql |
| 40 | + ├── executes against DATABASE (secret) |
| 41 | + ├── formats display values (number formatting, language ordering, dates) |
| 42 | + ├── writes data/stats.json (raw snapshot for history) |
| 43 | + └── renders templates/index.j2 → index.html |
| 44 | + │ |
| 45 | + ▼ |
| 46 | +GitHub Actions |
| 47 | + ├── commits data/stats.json [skip ci] |
| 48 | + ├── copies index.html + assets/ → _site/ |
| 49 | + └── deploys _site/ to GitHub Pages |
| 50 | +``` |
| 51 | + |
| 52 | +## Development |
| 53 | + |
| 54 | +```bash |
| 55 | +git clone https://github.com/hcss-utils/ruw-warehouse-overview.git |
| 56 | +cd ruw-warehouse-overview |
| 57 | +``` |
| 58 | + |
| 59 | +Work on a feature branch: |
| 60 | + |
| 61 | +```bash |
| 62 | +git checkout -b feature/<name> |
| 63 | +``` |
| 64 | + |
| 65 | +After making changes, run quality checks before committing: |
| 66 | + |
| 67 | +```bash |
| 68 | +bash quality.sh |
| 69 | +``` |
| 70 | + |
| 71 | +`quality.sh` runs ruff (linting + unused imports), isort, black, and ty (type checking) against Python 3.12. |
| 72 | + |
| 73 | +Set `DATABASE` to a valid PostgreSQL connection string to run `main.py` locally: |
| 74 | + |
| 75 | +```bash |
| 76 | +DATABASE=postgresql://... uv run python main.py |
| 77 | +``` |
| 78 | + |
| 79 | +The `DATABASE` secret used by GitHub Actions was set from a local `.env` file via the GitHub CLI: |
| 80 | + |
| 81 | +```bash |
| 82 | +gh secret set DATABASE --env-file .env |
| 83 | +``` |
0 commit comments