ray-project
diff --git a/‎doc/source/rllib-concepts.rst‎
Lines changed: 2 additions & 2 deletions b/‎doc/source/rllib-concepts.rst‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎doc/source/rllib-training.rst‎
Lines changed: 3 additions & 3 deletions b/‎doc/source/rllib-training.rst‎
Lines changed: 3 additions & 3 deletions
diff --git a/‎doc/source/rllib.rst‎
Lines changed: 1 addition & 1 deletion b/‎doc/source/rllib.rst‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎rllib/agents/trainer.py‎
Lines changed: 7 additions & 2 deletions b/‎rllib/agents/trainer.py‎
Lines changed: 7 additions & 2 deletions
diff --git a/‎rllib/evaluation/rollout_worker.py‎
Lines changed: 2 additions & 0 deletions b/‎rllib/evaluation/rollout_worker.py‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎rllib/examples/custom_tf_policy.py‎
Lines changed: 2 additions & 2 deletions b/‎rllib/examples/custom_tf_policy.py‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎rllib/policy/dynamic_tf_policy.py‎
Lines changed: 1 addition & 0 deletions b/‎rllib/policy/dynamic_tf_policy.py‎
Lines changed: 1 addition & 0 deletions
@@ -418,9 +418,9 @@ Finally, note that you do not have to use ``build_tf_policy`` to define a Tensor
 Building Policies in TensorFlow Eager
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-Policies built with ``build_tf_policy`` (most of the reference algorithms are) can be run in eager mode by setting the ``"eager": True`` config option or using ``rllib train --eager``. This will tell RLlib to execute the model forward pass, action distribution, loss, and stats functions in eager mode.
+Policies built with ``build_tf_policy`` (most of the reference algorithms are) can be run in eager mode by setting the ``"eager": True`` / ``"eager_tracing": True`` config options or using ``rllib train --eager [--trace]``. This will tell RLlib to execute the model forward pass, action distribution, loss, and stats functions in eager mode.
 
-Eager mode makes debugging much easier, since you can now use normal Python functions such as ``print()`` to inspect intermediate tensor values. However, it is slower than graph mode.
+Eager mode makes debugging much easier, since you can now use normal Python functions such as ``print()`` to inspect intermediate tensor values. However, it can be slower than graph mode unless tracing is enabled.
 
 You can also selectively leverage eager operations within graph mode execution with `tf.py_function <https://www.tensorflow.org/api_docs/python/tf/py_function>`__. Here's an example of using eager ops embedded `within a loss function <https://github.com/ray-project/ray/blob/master/rllib/examples/eager_execution.py>`__.
 
 
@@ -14,7 +14,7 @@ You can train a simple DQN trainer with the following command:
 
 .. code-block:: bash
 
-    rllib train --run DQN --env CartPole-v0  # add --eager for eager execution
+    rllib train --run DQN --env CartPole-v0  # --eager [--trace] for eager execution
 
 By default, the results will be logged to a subdirectory of ``~/ray_results``.
 This subdirectory will contain a file ``params.json`` which contains the
@@ -544,9 +544,9 @@ The ``"monitor": true`` config can be used to save Gym episode videos to the res
 Eager Mode
 ~~~~~~~~~~
 
-Policies built with ``build_tf_policy`` can be also run in eager mode by setting the ``"eager": True`` config option or using ``rllib train --eager``. This will tell RLlib to execute the model forward pass, action distribution, loss, and stats functions in eager mode.
+Policies built with ``build_tf_policy`` (most of the reference algorithms are) can be run in eager mode by setting the ``"eager": True`` / ``"eager_tracing": True`` config options or using ``rllib train --eager [--trace]``. This will tell RLlib to execute the model forward pass, action distribution, loss, and stats functions in eager mode.
 
-Eager mode makes debugging much easier, since you can now use normal Python functions such as ``print()`` to inspect intermediate tensor values. However, it is slower than graph mode.
+Eager mode makes debugging much easier, since you can now use normal Python functions such as ``print()`` to inspect intermediate tensor values. However, it can be slower than graph mode unless tracing is enabled.
 
 Episode Traces
 ~~~~~~~~~~~~~~
 
@@ -25,7 +25,7 @@ Then, you can try out training in the following equivalent ways:
 
 .. code-block:: bash
 
-  rllib train --run=PPO --env=CartPole-v0  # add --eager for eager execution
+  rllib train --run=PPO --env=CartPole-v0  # --eager [--trace] for eager execution
 
 .. code-block:: python
 
 
@@ -70,8 +70,12 @@
     "ignore_worker_failures": False,
     # Log system resource metrics to results.
     "log_sys_usage": True,
-    # Enable TF eager execution (TF policies only)
+    # Enable TF eager execution (TF policies only).
     "eager": False,
+    # Enable tracing in eager mode. This greatly improves performance, but
+    # makes it slightly harder to debug since Python code won't be evaluated
+    # after the initial eager pass.
+    "eager_tracing": False,
     # Disable eager execution on workers (but allow it on the driver). This
     # only has an effect is eager is enabled.
     "no_eager_on_workers": False,
@@ -333,7 +337,8 @@ def __init__(self, config=None, env=None, logger_creator=None):
 
         if tf and config.get("eager"):
             tf.enable_eager_execution()
-            logger.info("Executing eagerly")
+            logger.info("Executing eagerly, with eager_tracing={}".format(
+                "True" if config.get("eager_tracing") else "False"))
 
         if tf and not tf.executing_eagerly():
             logger.info("Tip: set 'eager': true or the --eager flag to enable "
 
@@ -752,6 +752,8 @@ def _build_policy_map(self, policy_dict, policy_config):
             if tf and tf.executing_eagerly():
                 if hasattr(cls, "as_eager"):
                     cls = cls.as_eager()
+                    if policy_config["eager_tracing"]:
+                        cls = cls.with_tracing()
                 elif not issubclass(cls, TFPolicy):
                     pass  # could be some other type of policy
                 else:
 
@@ -21,14 +21,14 @@ def policy_gradient_loss(policy, model, dist_class, train_batch):
     logits, _ = model.from_batch(train_batch)
     action_dist = dist_class(logits, model)
     return -tf.reduce_mean(
-        action_dist.logp(train_batch["actions"]) * train_batch["advantages"])
+        action_dist.logp(train_batch["actions"]) * train_batch["returns"])
 
 
 def calculate_advantages(policy,
                          sample_batch,
                          other_agent_batches=None,
                          episode=None):
-    sample_batch["advantages"] = discount(sample_batch["rewards"], 0.99)
+    sample_batch["returns"] = discount(sample_batch["rewards"], 0.99)
     return sample_batch
 
 
 
@@ -1,6 +1,7 @@
 from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
+"""Graph mode TF policy built using build_tf_policy()."""
 
 from collections import OrderedDict
 import logging