kocchop
diff --git a/‎.pre-commit-config.yaml‎
Lines changed: 1 addition & 1 deletion b/‎.pre-commit-config.yaml‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/explanations/performance_metrics.md‎
Lines changed: 1 addition & 1 deletion b/‎docs/explanations/performance_metrics.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/guides/understand_logs_and_metrics.md‎
Lines changed: 2 additions & 2 deletions b/‎docs/guides/understand_logs_and_metrics.md‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎docs/tutorials/grpo_with_pathways.md‎
Lines changed: 1 addition & 1 deletion b/‎docs/tutorials/grpo_with_pathways.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎src/MaxText/estimator.py‎
Lines changed: 4 additions & 4 deletions b/‎src/MaxText/estimator.py‎
Lines changed: 4 additions & 4 deletions
diff --git a/‎src/MaxText/examples/grpo_llama3_1_70b_demo_pw.py‎
Lines changed: 1 addition & 1 deletion b/‎src/MaxText/examples/grpo_llama3_1_70b_demo_pw.py‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎src/MaxText/examples/grpo_llama3_1_8b_demo_pw.py‎
Lines changed: 1 addition & 1 deletion b/‎src/MaxText/examples/grpo_llama3_1_8b_demo_pw.py‎
Lines changed: 1 addition & 1 deletion
@@ -9,7 +9,7 @@ repos:
         args:
           - '-w'
           - '--skip="*.txt,pylintrc,.*,src/MaxText/assets/*"'
-          - '-L ND,nd,sems,TE,ROUGE,rouge,astroid'
+          - '-L ND,nd,sems,TE,ROUGE,rouge,astroid,dout'
           - '.'
         additional_dependencies:
           - tomli
 
@@ -101,4 +101,4 @@ This shows any of step time, tokens/s or MFU can be used to determine how long t
 
 ## Why not hardware flops?
 
-Hardware (e.g., XLA reported) FLOPs do not accurately reflect computation efficiency as they depend on the program / implementation, not just on the model and its inherent computations (higher hardware FLOPs does not necessarily mean less room for improvement). For example, they include remat and potentially auxilliary operations (such as reshaping for dropping moe [here](https://github.com/AI-Hypercomputer/maxtext/blob/fafdeaa14183a8f5ca7b9f7b7542ce1655237574/src/MaxText/layers/moe.py#L1544)), which are an implementation detail and not part of the model. In addition, XLA reported FLOPs may not be accurate with pallas kernels. Hardware flops utilization is not (inversely) proportional to step time as opposed to MFU, since hardware flops can change with implementation details like remat policies.
+Hardware (e.g., XLA reported) FLOPs do not accurately reflect computation efficiency as they depend on the program / implementation, not just on the model and its inherent computations (higher hardware FLOPs does not necessarily mean less room for improvement). For example, they include remat and potentially auxiliary operations (such as reshaping for dropping moe [here](https://github.com/AI-Hypercomputer/maxtext/blob/fafdeaa14183a8f5ca7b9f7b7542ce1655237574/src/MaxText/layers/moe.py#L1544)), which are an implementation detail and not part of the model. In addition, XLA reported FLOPs may not be accurate with pallas kernels. Hardware flops utilization is not (inversely) proportional to step time as opposed to MFU, since hardware flops can change with implementation details like remat policies.
@@ -188,7 +188,7 @@ Per train step:
 
 In this example, given `model=deepseek2-16b`, `per_device_batch_size=24`, `max_target_length=2048`, and no gradient accumulation, we have $\text{model tflop per device} \approx 764.67$.
 - 94.54% of the TFLOPs are attributed to learnable weight and 5.46% are attributed to attention.
-- As you will see next, this number is important for calculating performace metrics, such as TFLOP/s/device and Model FLOPs Utilization (MFU).
+- As you will see next, this number is important for calculating performance metrics, such as TFLOP/s/device and Model FLOPs Utilization (MFU).
 
 You can find more information about model FLOPs and MFU in the [Performance Metrics](performance-metrics) topic.
 
@@ -231,7 +231,7 @@ $$\text{tflop/s/device} = \frac{\text{model tflop per device}}{\text{measured st
 
 $$\text{MFU} = \frac{\text{tflop/s/device}}{\text{peak hardware tflop/s}}$$
 
-  For TPU v5p, $\text{peak hardware tflop/s}=459$. Thus, $134.924 / 459 = 29.40$%. Note this is an example for explaination with small batch size and sequence length, so the MFU is not optimal.
+  For TPU v5p, $\text{peak hardware tflop/s}=459$. Thus, $134.924 / 459 = 29.40$%. Note this is an example for explanation with small batch size and sequence length, so the MFU is not optimal.
 
 **Tokens per second per device (throughput)**
 
 
@@ -66,5 +66,5 @@ The overview of the demo script ~/maxtext/src/MaxText/examples/grpo_llama3_1_70b
 
 1. We load a policy model and a reference model. Both are copies of `Llama3.1-70b-Instruct`.
 2. Evaluate the policy model's performance on GSM8K math reasoning benchmark.
-3. Train the policy model using GRPO with potentially different meshes for trainer and rollout dependending on the parameters `TRAINER_DEVICES_FRACTION` and `SAMPLER_DEVICES_FRACTION`. If we set both of these to `1.0`, the entire (same) mesh will be used for both trainer and rollout. If we set say `TRAINER_DEVICES_FRACTION=0.5` and `SAMPLER_DEVICES_FRACTION=0.5`, the first half of the devices will be used for trainer and the second half will be used for rollout
+3. Train the policy model using GRPO with potentially different meshes for trainer and rollout depending on the parameters `TRAINER_DEVICES_FRACTION` and `SAMPLER_DEVICES_FRACTION`. If we set both of these to `1.0`, the entire (same) mesh will be used for both trainer and rollout. If we set say `TRAINER_DEVICES_FRACTION=0.5` and `SAMPLER_DEVICES_FRACTION=0.5`, the first half of the devices will be used for trainer and the second half will be used for rollout
 4. Evaluate the policy model's performance on GSM8K math reasoning benchmark after the post-training with GRPO.
@@ -70,7 +70,7 @@ def tensor_score(tensor_name: str, config) -> tuple:
 
   The score is used to prioritize which tensors to offload/remat first. Tensors
   with a higher score are rematerialized later. The scoring is based on tensor
-  arithmatic intensity and memory size, with larger tensors getting lower scores
+  arithmetic intensity and memory size, with larger tensors getting lower scores
   (higher priority for remat).
 
   Args:
@@ -188,19 +188,19 @@ def largest_batch_size(base_argv, policy, min_pdb, max_pdb=64) -> int:
     print(f"No OOM at maximum batch size {max_pdb}.")
     return max_pdb
 
-  low, high, ans = min_pdb, max_pdb, min_pdb
+  low, high, result = min_pdb, max_pdb, min_pdb
   while low <= high:
     mid = (low + high) // 2
     if mid < min_pdb:
       low = mid + 1
       continue
 
     if not is_oom(base_argv, policy, mid):
-      ans = mid
+      result = mid
       low = mid + 1
     else:
       high = mid - 1
-  return ans
+  return result
 
 
 def is_oom(base_argv, policy: dict, pdb: int) -> bool:
 
@@ -82,7 +82,7 @@
 # for vLLM we can skip JAX precompilation with this flag, it makes startup faster
 os.environ["SKIP_JAX_PRECOMPILE"] = "1"
 
-# add the parent directory (two levels up to say ~/HOME/maxtext) to sys.path if currenlt runnig from
+# add the parent directory (two levels up to say ~/HOME/maxtext) to sys.path if currenlt running from
 # ~/HOME/maxtext/MaxText/examples
 
 # Get the directory of the current script
 
@@ -82,7 +82,7 @@
 # for vLLM we can skip JAX precompilation with this flag, it makes startup faster
 os.environ["SKIP_JAX_PRECOMPILE"] = "1"
 
-# add the parent directory (two levels up to say ~/HOME/maxtext) to sys.path if currenlt runnig from
+# add the parent directory (two levels up to say ~/HOME/maxtext) to sys.path if currenlt running from
 # ~/HOME/maxtext/MaxText/examples
 
 # Get the directory of the current script
Original file line number	Diff line number	Diff line change
`@@ -101,4 +101,4 @@ This shows any of step time, tokens/s or MFU can be used to determine how long t`
`101`	`101`
`102`	`102`	`## Why not hardware flops?`
`103`	`103`
`104`		-Hardware (e.g., XLA reported) FLOPs do not accurately reflect computation efficiency as they depend on the program / implementation, not just on the model and its inherent computations (higher hardware FLOPs does not necessarily mean less room for improvement). For example, they include remat and potentially auxilliary operations (such as reshaping for dropping moe [here](https://github.com/AI-Hypercomputer/maxtext/blob/fafdeaa14183a8f5ca7b9f7b7542ce1655237574/src/MaxText/layers/moe.py#L1544)), which are an implementation detail and not part of the model. In addition, XLA reported FLOPs may not be accurate with pallas kernels. Hardware flops utilization is not (inversely) proportional to step time as opposed to MFU, since hardware flops can change with implementation details like remat policies.
	`104`	+Hardware (e.g., XLA reported) FLOPs do not accurately reflect computation efficiency as they depend on the program / implementation, not just on the model and its inherent computations (higher hardware FLOPs does not necessarily mean less room for improvement). For example, they include remat and potentially auxiliary operations (such as reshaping for dropping moe [here](https://github.com/AI-Hypercomputer/maxtext/blob/fafdeaa14183a8f5ca7b9f7b7542ce1655237574/src/MaxText/layers/moe.py#L1544)), which are an implementation detail and not part of the model. In addition, XLA reported FLOPs may not be accurate with pallas kernels. Hardware flops utilization is not (inversely) proportional to step time as opposed to MFU, since hardware flops can change with implementation details like remat policies.