Skip to content

Commit db69ce3

Browse files
committed
add README
1 parent 6c5d41a commit db69ce3

1 file changed

Lines changed: 83 additions & 0 deletions

File tree

README.md

Lines changed: 83 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,83 @@
1+
# RUW Database Overview
2+
3+
Live dashboard: **https://hcss-utils.github.io/ruw-warehouse-overview/**
4+
5+
A lightweight status page for the **Russian-Ukrainian War (RUW) Database**. The page shows per-source document and chunk counts, annotation coverage, and freshness at a glance.
6+
7+
## What it shows
8+
9+
Each row in the table corresponds to one source database (Telegram channels, Kremlin, ISW, Integrum, etc.). For every source the page displays:
10+
11+
| Column | Description |
12+
|---|---|
13+
| Lang | Languages present in the source (up to 5) |
14+
| Documents | Total number of ingested documents |
15+
| Chunks | Total text chunks produced from those documents |
16+
| Relevant chunks | Chunks annotated as relevant, with coverage % |
17+
| Last Updated | Date of the most recent document in the source |
18+
19+
Summary cards at the top aggregate totals across all sources and show the latest data date across the entire corpus.
20+
21+
## How it works
22+
23+
### SQL logic
24+
25+
`assets/stats.sql` runs a single query against `public.uploaded_document` joined through `document_section``document_section_chunk`. Relevant chunks are determined by combining two annotation tables:
26+
27+
- `taxonomy` — legacy classifications
28+
- `taxonomy_annotation` — current pipeline annotations (`is_relevant = true` and `HLTP IS NOT NULL`)
29+
30+
The two are merged with `UNION` (deduplicating chunk IDs) before counting. Language codes are aggregated with `STRING_AGG(DISTINCT UPPER(...))` per source.
31+
32+
### Technical pipeline
33+
34+
```
35+
schedule (daily 06:00 UTC) or push to main
36+
37+
38+
main.py
39+
├── reads assets/stats.sql
40+
├── executes against DATABASE (secret)
41+
├── formats display values (number formatting, language ordering, dates)
42+
├── writes data/stats.json (raw snapshot for history)
43+
└── renders templates/index.j2 → index.html
44+
45+
46+
GitHub Actions
47+
├── commits data/stats.json [skip ci]
48+
├── copies index.html + assets/ → _site/
49+
└── deploys _site/ to GitHub Pages
50+
```
51+
52+
## Development
53+
54+
```bash
55+
git clone https://github.com/hcss-utils/ruw-warehouse-overview.git
56+
cd ruw-warehouse-overview
57+
```
58+
59+
Work on a feature branch:
60+
61+
```bash
62+
git checkout -b feature/<name>
63+
```
64+
65+
After making changes, run quality checks before committing:
66+
67+
```bash
68+
bash quality.sh
69+
```
70+
71+
`quality.sh` runs ruff (linting + unused imports), isort, black, and ty (type checking) against Python 3.12.
72+
73+
Set `DATABASE` to a valid PostgreSQL connection string to run `main.py` locally:
74+
75+
```bash
76+
DATABASE=postgresql://... uv run python main.py
77+
```
78+
79+
The `DATABASE` secret used by GitHub Actions was set from a local `.env` file via the GitHub CLI:
80+
81+
```bash
82+
gh secret set DATABASE --env-file .env
83+
```

0 commit comments

Comments
 (0)