docs: temperature

DavidePaglieri · DavidePaglieri · commit 16d468e57d21 · 2025-03-06T16:55:02.000Z
diff --git a/docs/evaluation.md b/docs/evaluation.md
@@ -93,7 +93,7 @@ python eval.py \
 | **client.is_chat_model**  | Indicates if the model follows a chat-based interface.                                            | `True`                                    |
 | **client.generate_kwargs.temperature** | Temperature for model response randomness.                                           | `0.0`                                     |
 | **client.alternate_roles** | If True the instruction prompt will be fused with first observation. Required by some LLMs.      | `False`                                     |
-| **client.temperature** | If set to null will default to the API default temperature. Use a float from 0.0 to 1.0. otherwise.      | `null`                                     |
+| **client.temperature**    | If set to null will default to the API default temperature. Use a float from 0.0 to 1.0. otherwise.  | `null`                                     |
 | **envs.names**            | Dash-separated list of environments to evaluate, e.g., `nle-minihack`.                            | `babyai-babaisai-textworld-crafter-nle-minihack`|
 
 
@@ -104,3 +104,5 @@ python eval.py \
   Mac systems might complain about fork when evaluating in multiprocessing mode (`eval.num_workers > 1`). To fix this export the following before running eval: `export OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES`
 - Alternate roles:
   Some LLMs/VLMs require alternating roles. You can fuse the instruction prompt with the first observation to comply with this with the following: `client.alternate_roles=True`
+- Temperature:
+  We recommend running models with temperature ranges around 0.5-0.7, or to use the default temperature of the model APIs. Too low temperatures can cause some of the more brittle models to endlessly repeat actions or create incoherent outputs.