sb-ai-lab · anathema-git · Aug 7, 2025 · Aug 12, 2025 · Aug 19, 2025 · Aug 26, 2025
diff --git a/examples/tutorials/AATestTutorial.ipynb b/examples/tutorials/AATestTutorial.ipynb
@@ -428,6 +428,27 @@
     "res.resume"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "55c32466",
+   "metadata": {},
+   "source": [
+    "**Interpretation of AA test results**\n",
+    "\n",
+    "Each row in the table corresponds to a target feature being tested for equality between the control and test groups. Two statistical tests are used:\n",
+    "\n",
+    "- **TTest**: tests if means are statistically different.\n",
+    "- **KSTest**: tests if distributions differ.\n",
+    "\n",
+    "The `OK` / `NOT OK` labels show whether the difference is statistically significant. A `NOT OK` result indicates a possible imbalance.\n",
+    "\n",
+    "Typical threshold:\n",
+    "- If p-value < 0.05 → `NOT OK` (statistically significant difference)\n",
+    "- If p-value ≥ 0.05 → `OK` (no significant difference)\n",
+    "\n",
+    "If any metric has a `NOT OK` status in the `AA test` column, it means at least one iteration showed significant difference.\n"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 5,
@@ -506,6 +527,21 @@
     "res.aa_score"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "eb0ce07b",
+   "metadata": {},
+   "source": [
+    "**Interpreting `aa_score`**\n",
+    "\n",
+    "This output shows p-values and the overall pass/fail status for each test type and feature. A high p-value (close to 1.0) means the test passed — the groups are similar.\n",
+    "\n",
+    "- `score`: p-value of the statistical test.\n",
+    "- `pass`: True if no iterations showed significant differences.\n",
+    "\n",
+    "Note: Even if the average p-value is high, the `pass` might still be False if at least one of the iterations had a p-value < 0.05.\n"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 6,
@@ -726,6 +762,18 @@
     "res.best_split"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "a225e982",
+   "metadata": {},
+   "source": [
+    "**About `best_split`**\n",
+    "\n",
+    "This shows the best found split of the dataset, where control and test groups are as similar as possible in terms of target metrics.\n",
+    "\n",
+    "You can use this split for future modeling or as a validation check before proceeding to actual experiments.\n"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 7,
@@ -824,6 +872,22 @@
     "res.best_split_statistic"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "ef1986ae",
+   "metadata": {},
+   "source": [
+    "**Understanding `best_split_statistic`**\n",
+    "\n",
+    "This table contains detailed statistics for the best (most balanced) split found across all iterations. You can compare:\n",
+    "\n",
+    "- Mean values in control vs test group.\n",
+    "- Absolute and relative differences.\n",
+    "- p-values for both tests.\n",
+    "\n",
+    "Ideally, all rows should have `OK` in both TTest and KSTest columns, and small difference values (<1%)."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 8,
@@ -2085,12 +2149,16 @@
    "source": [
     "# AA Test with stratification\n",
     "\n",
-    "Depending on your requirements it is possible to stratify the data. You can set `stratification=True` and `StratificationRole` in `Dataset` to run it with stratification.  "
+    "Depending on your requirements it is possible to stratify the data. You can set `stratification=True` and `StratificationRole` in `Dataset` to run it with stratification.\n",
+    "\n",
+    "Stratified AA tests ensure that both groups (control/test) have the same proportions of categories (e.g. same % of genders or regions). This prevents imbalances in categorical features that can distort results.\n",
+    "\n",
+    "Make sure to assign `StratificationRole` to relevant columns in your dataset before enabling stratification."
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 15,
+   "execution_count": null,
    "id": "da9ab2f374ce1273",
    "metadata": {
     "ExecuteTime": {
@@ -5337,6 +5405,20 @@
    "source": [
     "res.best_split_statistic"
    ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d3dd84bc",
+   "metadata": {},
+   "source": [
+    "## Common issues and tips\n",
+    "\n",
+    "- **Missing roles**: Make sure all target variables are assigned `TargetRole`. Columns without roles may cause silent failure.\n",
+    "- **Stratification**: If your dataset contains categorical features (e.g. `gender`, `region`) that may affect the outcome, use `StratificationRole` and enable `stratification=True` in `AATest(...)`.\n",
+    "- **Imbalanced categories**: If some categories have too few samples, stratified splits may become unstable. Consider filtering or merging rare categories.\n",
+    "- **Random fluctuations**: On small datasets, it's normal to see occasional `NOT OK` results. Use more iterations (e.g. `n_iterations=50`) for stability.\n",
+    "- **Missing values**: NaNs in stratification columns may be treated as separate categories. Clean or fill missing values before stratified AA tests."
+   ]
   }
  ],
  "metadata": {