Government RAG Example: Incorrect location extraction in basemodel.py breaks province filtering

### Background

When running the `government_rag` benchmark example via the standard command:

```bash
ianvs -f ./examples/government_rag/singletask_learning_bench/benchmarkingjob.yaml
```

the test attempts to extract the location/province string to correctly filter the RAG target dataset.

However, the location extraction logic relies entirely on the basename of the current working directory, which incorrectly returns the root directory name instead of the target province.

---

### The Bug

In `examples/government_rag/singletask_learning_bench/testalgorithms/basemodel.py` around line 211, the script assigns:

```python
current_dir = os.path.basename(os.getcwd())
```

Because `ianvs` is usually executed from the root of the project, `os.getcwd()` returns the path to the root folder. As a result, `current_dir` becomes `"ianvs"` (or whatever the root workspace is named) rather than an actual province string (like `"Beijing"` or `"Shanghai"`).

---

### Impact

This wrong location string (`loc`) is then passed as an argument to all subsequent tasks in the threading pipeline. Consequently, `[local]` and `[other]` RAG queries use the literal string `"ianvs"` as the province name filter, which:

- Completely breaks province-level target filtering
- Produces no exception or warning — the failure is **silent**
- Compromises the integrity of all benchmark results

---

### Steps to Reproduce

1. Run the benchmarking job from the project root:
```bash
   ianvs -f ./examples/government_rag/singletask_learning_bench/benchmarkingjob.yaml
```
2. At `basemodel.py:211`, `os.getcwd()` resolves to the project root directory, so `current_dir` becomes `"ianvs"`.
3. The invalid location string is forwarded to `self.rag.query()`, causing silent filtering failures across all province-level queries.

---

### Proposed Solution

Replace the `os.getcwd()`-based approach with a method that derives the target province from the dataset directly. The dataset's JSONL file contains a `level_4_dim` field that holds the correct province name and should be used as the source of truth:

```python
def _load_locations_from_dataset(self, data):
    query_to_location = {}
    all_locations = set()

    dataset_path = getattr(data, 'data_file', None)

    if dataset_path and os.path.isfile(dataset_path):
        with open(dataset_path, 'r', encoding='utf-8') as f:
            for line in f:
                line = line.strip()
                if not line:
                    continue
                entry = json.loads(line)
                query = entry.get('query', '')
                location = entry.get('level_4_dim', 'Unknown')
                query_to_location[query] = location
                all_locations.add(location)

    return query_to_location, list(all_locations)
```

Then in `predict()`, replace the `os.getcwd()` call with:

```python
query_to_location, all_locations = self._load_locations_from_dataset(data)
self.all_locations = all_locations

for i in range(len(data.x)):
    query = data.x[i]
    location = query_to_location.get(query, "Unknown")
```

This ensures the province is always derived from the dataset metadata, regardless of where `ianvs` is invoked from.

---

### Additional Issues Found

While investigating this bug, the following additional issues were identified and fixed in this PR:

| Issue | Fix |
|---|---|
| Hardcoded API keys in `get_model_response_*` methods | Replaced with environment variables |
| Hardcoded model path `/home/icyfeather/models/bge-m3` | Made configurable via `__init__` kwargs or `Context` |
| Hardcoded `persist_directory="./chroma_db"` | Made configurable via `__init__` kwargs |
| `self.rag` mutated across threads causing race condition | Changed to local variable `rag` inside `process_query` |
| Unused imports (`tempfile`, `time`, `zipfile`, `numpy`, etc.) | Removed |
| `json` imported twice | Removed duplicate import |

---

### Files Changed

- `examples/government_rag/singletask_learning_bench/testalgorithms/basemodel.py`
- `.env.example` _(added — documents required environment variables)_

Issue	Fix
Hardcoded API keys in `get_model_response_*` methods	Replaced with environment variables
Hardcoded model path `/home/icyfeather/models/bge-m3`	Made configurable via `__init__` kwargs or `Context`
Hardcoded `persist_directory="./chroma_db"`	Made configurable via `__init__` kwargs
`self.rag` mutated across threads causing race condition	Changed to local variable `rag` inside `process_query`
Unused imports (`tempfile`, `time`, `zipfile`, `numpy`, etc.)	Removed
`json` imported twice	Removed duplicate import

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Government RAG Example: Incorrect location extraction in basemodel.py breaks province filtering #379

Background

The Bug

Impact

Steps to Reproduce

Proposed Solution

Additional Issues Found

Files Changed

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Government RAG Example: Incorrect location extraction in basemodel.py breaks province filtering #379

Description

Background

The Bug

Impact

Steps to Reproduce

Proposed Solution

Additional Issues Found

Files Changed

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions