Standardize on the reward functions (#86)

phillipleblanc · web-flow · commit c74ab804d4b6 · 2021-12-08T22:42:26.000+09:00
* prev_state -&gt; current_state; new_state -&gt; next_state

* Docs changes
diff --git a/spiceaidocs/content/en/concepts/interpretations/_index.md b/spiceaidocs/content/en/concepts/interpretations/_index.md
@@ -30,6 +30,6 @@ The interpretation is defined as a time range from `start` to `end`, with a `nam
 
 Interpretations can be used to provide hints to the reward function on how to reward a time step. In the above example, when the training reaches Tuesday, the reward function author might choose to reward buys even higher based on that expert input.
 
-When the action specific reward function is called, if there is an interpretation in that time range, it will be provided to the reward function in `[state].interpretations`. E.g. if an interpretation overlapped with new state then `new_state.interpretations` would contain a list of the overlapping interpretations.
+When the action specific reward function is called, if there is an interpretation in that time range, it will be provided to the reward function in `[state]_interpretations`. E.g. if an interpretation overlapped with new state then `next_state_interpretations` would contain a list of the overlapping interpretations.
 
 Comparing Spice.ai recommendations to interpretations is also one way of testing Spice.ai recommendations against expected actions for input data.
diff --git a/spiceaidocs/content/en/concepts/rewards/_index.md b/spiceaidocs/content/en/concepts/rewards/_index.md
@@ -22,10 +22,10 @@ The reward function must assign a value to `reward` for it to be valid.
 
 The following variables are available to be used in the reward function:
 
-| variable   | Type                                                                                  | Description                                                    |
-| ---------- | ------------------------------------------------------------------------------------- | -------------------------------------------------------------- |
-| prev_state | [SimpleNamespace](https://docs.python.org/3/library/types.html#types.SimpleNamespace) | The observation state when the action was taken                |
-| new_state  | [SimpleNamespace](https://docs.python.org/3/library/types.html#types.SimpleNamespace) | The observation state from directly after the action was taken |
+| variable      | Type                                                                   | Description                                                           |
+| ------------- | ---------------------------------------------------------------------- | --------------------------------------------------------------------- |
+| current_state | [dict](https://docs.python.org/3.8/library/stdtypes.html#typesmapping) | The observation state when the action was taken                       |
+| next_state    | [dict](https://docs.python.org/3.8/library/stdtypes.html#typesmapping) | The observation state one granularity step after the action was taken |
 
 ### Example
 
@@ -37,36 +37,36 @@ training:
     - reward: close_valve
       # Reward keeping moisture content above 25%
       with: |
-        if new_state.sensors_garden_moisture > 0.25:
+        if next_state["sensors_garden_moisture"] > 0.25:
           reward = 200
 
         # Penalize low moisture content depending on how far the garden has dried out
         else:
-          reward = -100 * (0.25 - new_state.sensors_garden_moisture)
+          reward = -100 * (0.25 - next_state["sensors_garden_moisture"])
 
-          # Penalize especially heavily if the drying trend is continuing (new_state is drier than prev_state)
-          if new_state.sensors_garden_moisture < prev_state.sensors_garden_moisture:
+          # Penalize especially heavily if the drying trend is continuing (next_state is drier than current_state)
+          if next_state["sensors_garden_moisture"] < current_state["sensors_garden_moisture"]:
             reward = reward * 2
 
     - reward: open_valve_half
       # Reward watering when needed, more heavily if the garden is more dried out
       with: |
-        if new_state.sensors_garden_moisture < 0.25:
-          reward = 100 * (0.25 - new_state.sensors_garden_moisture)
+        if next_state["sensors_garden_moisture"] < 0.25:
+          reward = 100 * (0.25 - next_state["sensors_garden_moisture"])
 
         # Penalize wasting water
         # Penalize overwatering depending on how overwatered the garden is
         else:
-          reward = -50 * (new_state.sensors_garden_moisture - 0.25)
+          reward = -50 * (next_state["sensors_garden_moisture"] - 0.25)
 
     - reward: open_valve_full
       # Reward watering when needed, more heavily if the garden is more dried out
       with: |
-        if new_state.sensors_garden_moisture < 0.25:
-          reward = 200 * (0.25 - new_state.sensors_garden_moisture)
+        if next_state["sensors_garden_moisture"] < 0.25:
+          reward = 200 * (0.25 - next_state["sensors_garden_moisture")
 
         # Penalize wasting water more heavily with valve fully open
         # Penalize overwatering depending on how overwatered the garden is
         else:
-          reward = -100 * (new_state.sensors_garden_moisture - 0.25)
+          reward = -100 * (next_state["sensors_garden_moisture"] - 0.25)
 ```
diff --git a/spiceaidocs/content/en/reference/pod/_index.md b/spiceaidocs/content/en/reference/pod/_index.md
@@ -122,7 +122,7 @@ Pod time, time-series and time-data related configuration is defined in the `tim
 
 A list of time categories, such as `month` or `weekday` enabling the automatic creation of fields from the observation `time`. For example, by specifiying `month` the Spice.ai engine automatically creates a field in the data called `time_month_<month>` with a value calculated from the month of which that timestamp relates. This enables learning from cyclical patterns, such as monthly or daily cycles.
 
-***Example***
+**_Example_**
 
 ```yaml
 time:
@@ -758,17 +758,15 @@ training:
 
 A python code block that will be run before an action specific reward code block runs. Use this to define common variables that will be useful to reference in the specific reward code blocks.
 
-Access observation state variables by specifying their fully qualified names and prefixing with `prev_state.` for the value at the previous state before the action was taken, and `new_state.` for the value of the state right after the action was taken.
-
 **Example**
 
 ```yaml
 training:
   reward_init: |
     # Compute price change between previous state and this one 
     # so it can be used in all three reward functions
-    prev_price = prev_state.coinbase.btcusd.close
-    new_price = new_state.coinbase.btcusd.close
+    prev_price = current_state["coinbase_btcusd_close"]
+    new_price = next_state["coinbase_btcusd_close"]
     change_in_price = new_price - prev_price
   rewards:
     - reward: buy
@@ -784,6 +782,10 @@ training:
           reward = 0.1
 ```
 
+### `training.reward_funcs`
+
+The path to a Python file that defines the reward functions to use, instead of python code blocks.
+
 ### `training.rewards`
 
 **Required**. Defines how to reward the Spice.ai runtime during training so that it learns to take more intelligent actions.
@@ -822,18 +824,8 @@ training:
 
 ### `training.rewards[*].with`
 
-A python code block that needs to assign a variable to `reward` to specify which reward to give the Spice.ai agent for taking this action.
+If `training.reward_funcs` is defined, then this should be the name of the function defined in the python file to use for specifying which reward to give the Spice.ai agent for taking this action.
 
-Access observation state variables by specifying their fully qualified names and prefixing with `prev_state.` for the value at the previous state before the action was taken, and `new_state.` for the value of the state right after the action was taken.
+If `training.reward_funcs` is not defined, then this is a python code block that needs to assign a variable to `reward` to specify which reward to give the Spice.ai agent for taking this action.
 
-```yaml
-training:
-  rewards:
-    - reward: jump
-      with: |
-        # If we weren't able to jump, penalize trying to jump
-        if new_state.game.character.height > prev_state.game.character.height:
-          reward = 1
-        else:
-          reward = -1
-```
+See [Rewards]({{<ref "concepts/rewards">}}) for more information on how to define reward functions.
diff --git a/spiceaidocs/content/en/reference/pod/quickstarts-trader.md b/spiceaidocs/content/en/reference/pod/quickstarts-trader.md
@@ -73,8 +73,8 @@ training:
   # Compute price change between previous state and this one
   # so it can be used in all three reward functions
   reward_init: |
-    prev_price = prev_state.coinbase.btcusd.close
-    new_price = new_state.coinbase_btcusd_close
+    prev_price = current_state["coinbase_btcusd_close"]
+    new_price = next_state["coinbase_btcusd_close"]
     change_in_price = new_price - prev_price
 
   rewards:
diff --git a/spiceaidocs/content/en/reference/pod/samples-gardener.md b/spiceaidocs/content/en/reference/pod/samples-gardener.md
@@ -38,36 +38,36 @@ training:
     - reward: close_valve
       # Reward keeping moisture content above 25%
       with: |
-        if new_state.sensors_garden_moisture > 0.25:
+        if next_state["sensors_garden_moisture"] > 0.25:
           reward = 200
 
         # Penalize low moisture content depending on how far the garden has dried out
         else:
-          reward = -100 * (0.25 - new_state.sensors_garden_moisture)
+          reward = -100 * (0.25 - next_state["sensors_garden_moisture"])
 
-          # Penalize especially heavily if the drying trend is continuing (new_state is drier than prev_state)
-          if new_state.sensors_garden_moisture < prev_state.sensors_garden_moisture:
+          # Penalize especially heavily if the drying trend is continuing (next_state is drier than current_state)
+          if next_state["sensors_garden_moisture"] < current_state["sensors_garden_moisture"]:
             reward = reward * 2
 
     - reward: open_valve_half
       # Reward watering when needed, more heavily if the garden is more dried out
       with: |
-        if new_state.sensors_garden_moisture < 0.25:
-          reward = 100 * (0.25 - new_state.sensors_garden_moisture)
+        if next_state["sensors_garden_moisture"] < 0.25:
+          reward = 100 * (0.25 - next_state["sensors_garden_moisture"])
 
         # Penalize wasting water
         # Penalize overwatering depending on how overwatered the garden is
         else:
-          reward = -50 * (new_state.sensors_garden_moisture - 0.25)
+          reward = -50 * (next_state["sensors_garden_moisture"] - 0.25)
 
     - reward: open_valve_full
       # Reward watering when needed, more heavily if the garden is more dried out
       with: |
-        if new_state.sensors_garden_moisture < 0.25:
-          reward = 200 * (0.25 - new_state.sensors_garden_moisture)
+        if next_state["sensors_garden_moisture"] < 0.25:
+          reward = 200 * (0.25 - next_state["sensors_garden_moisture"])
 
         # Penalize wasting water more heavily with valve fully open
         # Penalize overwatering depending on how overwatered the garden is
         else:
-          reward = -100 * (new_state.sensors_garden_moisture - 0.25)
+          reward = -100 * (next_state["sensors_garden_moisture"] - 0.25)
 ```
diff --git a/spiceaidocs/content/en/reference/pod/samples-serverops.md b/spiceaidocs/content/en/reference/pod/samples-serverops.md
@@ -42,8 +42,8 @@ training:
   reward_init: |
     high_cpu_usage_threshold = 10
 
-    cpu_usage_new = 100 - new_state.hostmetrics_cpu_usage_idle
-    cpu_usage_prev = 100 - prev_state.hostmetrics_cpu_usage_idle
+    cpu_usage_new = 100 - next_state["hostmetrics_cpu_usage_idle"]
+    cpu_usage_prev = 100 - current_state["hostmetrics_cpu_usage_idle"]
     cpu_usage_delta = cpu_usage_new - cpu_usage_prev
 
     cpu_usage_delta_abs = cpu_usage_delta