Merge pull request #11 from Sahandfer/feature/consistentMI_reranker

Sahandfer · web-flow · commit bb2925c3cc56 · 2026-04-29T10:48:28.000+08:00
Switched the reranker for ConsistentMI to LiteLLM from HF
diff --git a/docs/docs/components/clients/consistentmi.md b/docs/docs/components/clients/consistentmi.md
@@ -24,7 +24,7 @@ ConsistentMI simulates clients in motivational interviewing (MI) sessions with c
 
 1. **Load Profile**: Reads the character JSON (personas, beliefs, acceptable plans, motivation topics) and initializes `stage` and `receptivity`.
 2. **Initialize Prompts**: Builds a system prompt that anchors the client’s behavior/goal and injects personas + beliefs for consistency.
-3. **Track Topic Engagement**: Matches the therapist’s latest utterance to a motivation topic, then uses the topic graph distance to update `engagement` and count repeated off-topic turns.
+3. **Track Topic Engagement**: Matches the therapist’s latest utterance to a motivation topic using a reranker-backed topic matcher, then uses the topic graph distance to update `engagement` and count repeated off-topic turns. If reranking is unavailable or returns no valid scores, ConsistentMI falls back to lexical matching.
 4. **Verify Motivation (Optional)**: If the therapist addresses the client’s core motivation, the client enters a short `Motivation` state for an acknowledging response.
 5. **Sample a Stage-Consistent Action**: An LLM predicts an action distribution conditioned on recent context and the current stage.
 6. **Select Grounding Detail**: For actions like `Inform/Downplay/Blame/Hesitate/Plan`, the client selects a relevant persona/belief/plan (only when the therapist asks a question) to ground the next reply.
@@ -58,16 +58,43 @@ response = client.generate_response(
 print(response)
 ```
 
+> ⚠️ **Hint:**
+>
+> - ConsistentMI use a local reranker served through vLLM's OpenAI-compatible `/rerank` endpoint.
+> - Set `LOCAL_BASE_URL` and `LOCAL_API_KEY` in `.env`; PatientHub reuses them for the reranker.
+> - Use `reranker_model_type=LOCAL`.
+> - Set `reranker_model_name` to the LiteLLM vLLM route, e.g. `hosted_vllm/BAAI/bge-reranker-v2-m3`.
+> - If the reranker server runs on the same machine, prefer `127.0.0.1` over `0.0.0.0` in `LOCAL_BASE_URL`.
+
 ## Configuration
 
-| Option             | Description                      | Default                                        |
-| ------------------ | -------------------------------- | ---------------------------------------------- |
-| `prompt_path`      | Path to prompt file              | `data/prompts/client/consistentMI.yaml`        |
-| `data_path`        | Path to character file           | `data/characters/ConsistentMI.json`            |
-| `data_idx`         | Character index                  | `0`                                            |
-| `topics_path`      | Topics from Wiki                 | `data/resources/ConsistentMI/topics.json`      |
-| `topic_graph_path` | Correlation between topics       | `data/resources/ConsistentMI/topic_graph.json` |
-| `model_retriever`  | retrieve the most relevant topic | None                                           |
+| Option                | Description                                             | Default                                        |
+| --------------------- | ------------------------------------------------------- | ---------------------------------------------- |
+| `prompt_path`         | Path to prompt file                                     | `data/prompts/client/consistentMI.yaml`        |
+| `data_path`           | Path to character file                                  | `data/characters/ConsistentMI.json`            |
+| `data_idx`            | Character index                                         | `0`                                            |
+| `topics_path`         | Topics from Wiki                                        | `data/resources/ConsistentMI/topics.json`      |
+| `topic_graph_path`    | Correlation between topics                              | `data/resources/ConsistentMI/topic_graph.json` |
+| `reranker_model_type` | Provider key for topic reranking                        | `LOCAL`                                        |
+| `reranker_model_name` | LiteLLM model route for the reranker                    | `hosted_vllm/BAAI/bge-reranker-v2-m3`          |
+
+### Local Reranker Example
+
+```yaml
+client:
+  agent_name: consistentMI
+  model_type: OPENAI
+  model_name: gpt-4o
+  reranker_model_type: LOCAL
+  reranker_model_name: hosted_vllm/BAAI/bge-reranker-v2-m3
+```
+
+With a local vLLM reranker server, your `.env` should contain:
+
+```bash
+LOCAL_BASE_URL=http://127.0.0.1:7891/v1
+LOCAL_API_KEY=EMPTY
+```
 
 ## Character Data Format
 
diff --git a/docs/docs/getting-started/configuration.md b/docs/docs/getting-started/configuration.md
@@ -22,11 +22,13 @@ For example,
 OPENAI_API_KEY=your_openai_key
 OPENAI_BASE_URL=https://api.openai.com
 
-# For VLLM (n this case, model_type = VLLM)
-VLLM_BASE_URL=http://127.0.0.1
-VLLM_API_KEY=None
+# For local OpenAI-compatible servers (model_type = LOCAL)
+LOCAL_BASE_URL=http://127.0.0.1:8000/v1
+LOCAL_API_KEY=EMPTY
 ```
 
+`model_type` is used to select the environment-variable namespace. For example, `model_type=LOCAL` makes PatientHub read `LOCAL_BASE_URL` and `LOCAL_API_KEY`.
+
 ## Model Configuration
 
 ### Using OpenAI (Default)
@@ -65,9 +67,14 @@ config = {
 ```yaml
 client:
   agent_name: consistentMI
-  initial_stage: precontemplation # precontemplation, contemplation, preparation, action
+  model_type: OPENAI
+  model_name: gpt-4o
+  reranker_model_type: LOCAL
+  reranker_model_name: hosted_vllm/BAAI/bge-reranker-v2-m3
 ```
 
+`ConsistentMI` uses the main `model_type` / `model_name` pair for response generation and a separate `reranker_model_type` / `reranker_model_name` pair for topic matching. The reranker currently reuses `LOCAL_BASE_URL` and `LOCAL_API_KEY`.
+
 #### SimPatient
 
 ```yaml
diff --git a/docs/docs/getting-started/installation.md b/docs/docs/getting-started/installation.md
@@ -75,6 +75,44 @@ LOCAL_API_KEY=EMPTY
 
 Then set your config to use `model_type=LOCAL` and `model_name` to the model name exposed by your vLLM server.
 
+### Local Reranker Models via vLLM
+
+`ConsistentMI` can also use a local reranker served by vLLM's OpenAI-compatible `/rerank` endpoint.
+
+1) Start a reranker model with vLLM:
+
+```bash
+vllm serve BAAI/bge-reranker-v2-m3 --host 0.0.0.0 --port 7891
+```
+
+2) Point `LOCAL_BASE_URL` at the reranker server:
+
+```bash
+LOCAL_BASE_URL=http://127.0.0.1:7891/v1
+LOCAL_API_KEY=EMPTY
+```
+
+3) Use the LiteLLM vLLM route in `ConsistentMI`:
+
+```yaml
+client:
+  agent_name: consistentMI
+  reranker_model_type: LOCAL
+  reranker_model_name: hosted_vllm/BAAI/bge-reranker-v2-m3
+```
+
+:::tip Localhost vs 0.0.0.0
+Use `0.0.0.0` for the server listen address, but use `127.0.0.1` or the machine's real IP in `LOCAL_BASE_URL`.
+:::
+
+:::tip Proxy settings
+If your shell exports `http_proxy` or `https_proxy`, local requests to the reranker can be sent to the proxy instead of your vLLM server. For local testing, either unset those variables or set:
+
+```bash
+export NO_PROXY=127.0.0.1,localhost
+```
+:::
+
 :::note vLLM fails to start
 it’s usually a CUDA/driver mismatch on the serving machine—check your NVIDIA driver/CUDA runtime and use a vLLM version compatible with your environment.
 :::
diff --git a/patienthub/clients/consistentMI.py b/patienthub/clients/consistentMI.py
@@ -37,7 +37,9 @@ class ConsistentMIClientConfig(APIModelConfig):
     prompt_path: str = "data/prompts/client/consistentMI.yaml"
     data_path: str = "data/characters/ConsistentMI.json"
     topics_path: str = "data/resources/ConsistentMI/topics.json"
-    topic_graph_path: str = "data/resources/ConsistentMI/topic_graph.json"
+    topic_graph_path: str = "data/resources/ConsistentMI/topic_graph.json"    
+    reranker_model_type: str = "LOCAL"
+    reranker_model_name: str = "hosted_vllm/BAAI/bge-reranker-v2-m3"
     data_idx: int = 0
 
 
@@ -186,7 +188,7 @@ class TopicMatcher:
     def __init__(self, configs: Dict[str, Any]):
         self.topic_graph = load_json(configs.topic_graph_path)
         self.reranker = (
-            get_reranker(configs.model_retriever) if configs.model_retriever else None
+            get_reranker(configs)
         )
         self.all_topics = self.extract_all_topics()
         self.topic_passages: List[str] = []
@@ -230,6 +232,7 @@ def find_related_topics(self, query: str, top_k: int = 5) -> List[str]:
         top_indices = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[
             :top_k
         ]
+        print(f"Related topics: {[self.all_topics[i] for i in top_indices]}")
         return [self.all_topics[i] for i in top_indices]
 
     def score_passages(self, query: str) -> Optional[List[float]]:
diff --git a/patienthub/utils/models.py b/patienthub/utils/models.py
@@ -4,7 +4,7 @@
 from dotenv import load_dotenv
 from dataclasses import dataclass
 from typing import Any, List, Optional, Dict
-from litellm import completion, supports_response_schema, completion_cost
+from litellm import completion, supports_response_schema, completion_cost, rerank
 
 logging.getLogger("LiteLLM").setLevel(logging.WARNING)
 
@@ -91,82 +91,80 @@ def get(name, default=None):
 
 @dataclass
 class Reranker:
-    def __init__(self, tokenizer: Any, model: Any, device: Any):
-        self.tokenizer = tokenizer
-        self.model = model
-        self.device = device
-
-    def score(
-        self, query: str, passages: List[str], max_length: int = 512
-    ) -> Optional[List[float]]:
-        """Score (query, passage) pairs. Higher = more relevant."""
-        if not passages:
+    """Reranker backed by LiteLLM's hosted_vllm provider."""
+
+    model_name: str
+    api_base: Optional[str] = None
+    api_key: Optional[str] = None
+
+    @staticmethod
+    def read_field(obj: Any, name: str, default: Any = None) -> Any:
+        if isinstance(obj, dict):
+            return obj.get(name, default)
+        return getattr(obj, name, default)
+
+    @classmethod
+    def extract_scores(cls, response: Any, total_docs: int) -> Optional[List[float]]:
+        scores = [0.0] * total_docs
+        results = cls.read_field(response, "results", []) or []
+        valid_count = 0
+
+        for item in results:
+            index = cls.read_field(item, "index")
+            relevance_score = cls.read_field(item, "relevance_score")
+            if relevance_score is None:
+                relevance_score = cls.read_field(item, "score")
+
+            try:
+                index = int(index)
+                relevance_score = float(relevance_score)
+            except (TypeError, ValueError):
+                continue
+
+            if 0 <= index < total_docs:
+                scores[index] = relevance_score
+                valid_count += 1
+
+        if valid_count == 0:
             return None
 
-        pairs = [(query, passage) for passage in passages]
+        return scores
 
-        try:
-            return self.compute_scores(pairs, max_length)
-        except Exception:
+    def score(self, query: str, passages: List[str]) -> Optional[List[float]]:
+        """Score passages through LiteLLM's rerank endpoint."""
+        if not passages:
             return None
 
-    def compute_scores(self, pairs: List[tuple], max_length: int) -> List[float]:
-        """Compute relevance scores for query-passage pairs."""
-        import torch
-
-        with torch.no_grad():
-            inputs = self.tokenizer(
-                pairs,
-                padding=True,
-                truncation=True,
-                return_tensors="pt",
-                max_length=max_length,
+        try:
+            response = rerank(
+                model=self.model_name,
+                query=query,
+                documents=passages,
+                top_n=len(passages),
+                return_documents=False,
+                api_base=self.api_base,
+                api_key=self.api_key,
             )
-            inputs = {k: v.to(self.device) for k, v in inputs.items()}
-            outputs = self.model(**inputs, return_dict=True)
-            logits = outputs.logits.view(-1).float()
-            return torch.sigmoid(logits).tolist()
-
-
-def get_device(device_index: int):
-    import torch
-
-    try:
-        device_index = int(device_index)
-    except Exception:
-        device_index = 0
-
-    if torch.cuda.is_available() and device_index >= 0:
-        return torch.device(f"cuda:{device_index}")
-    return torch.device("cpu")
-
-
-def load_reranker_model(model_name: str, device: Any):
-    """Load tokenizer and model for reranking."""
-    from transformers import AutoModelForSequenceClassification, AutoTokenizer
+        except Exception:
+            return None
 
-    tokenizer = AutoTokenizer.from_pretrained(model_name)
-    model = AutoModelForSequenceClassification.from_pretrained(model_name)
-    model.to(device)
-    model.eval()
-    return tokenizer, model
+        return self.extract_scores(response, len(passages))
 
 
 def get_reranker(configs: Any) -> Optional[Reranker]:
-    """Get a Reranker instance from config, or None if unavailable."""
+    """Get a LOCAL reranker backed by LiteLLM's hosted_vllm provider."""
 
     def get(name, default=None):
         return get_config_value(configs, name, default)
 
-    model_type = get("model_type")
-    model_name = get("model_name")
+    model_type = get("reranker_model_type")
+    model_name = get("reranker_model_name")
 
-    if model_type not in ("huggingface", "local") or not model_name:
+    if model_type != "LOCAL" or not model_name:
         return None
 
-    try:
-        device = get_device(get("device", 0))
-        tokenizer, model = load_reranker_model(model_name, device)
-        return Reranker(tokenizer=tokenizer, model=model, device=device)
-    except Exception:
-        return None
+    return Reranker(
+        model_name=model_name,
+        api_base=os.environ.get("LOCAL_BASE_URL"),
+        api_key=os.environ.get("LOCAL_API_KEY"),
+    )