Cleanup/benchmark #20

jack-arturo · 2025-12-08T23:11:09Z

This pull request introduces comprehensive benchmarking infrastructure and documentation to enable fair, apples-to-apples evaluation of AutoMem against the CORE memory system on the LoCoMo benchmark. It adds official reproducibility guides, competitor analysis, and implements a Node.js adapter for running CORE's evaluation methodology directly against AutoMem. The changes clarify the true performance gap, highlight the impact of evaluation methodology, and provide actionable recommendations for closing the gap with CORE.

Benchmarking and Evaluation Infrastructure:

Added an official reproducibility guide (REPRODUCIBILITY.md) with step-by-step instructions for running all benchmark modes, comparing results, and troubleshooting. This ensures anyone can replicate the reported numbers.
Introduced a detailed benchmark analysis report (BENCHMARK_ANALYSIS_2025-12-03.md) that compares AutoMem and CORE across evaluation modes, explains discrepancies, and provides category breakdowns and recommendations for reporting results.
Added a competitor analysis document (COMPETITOR_ANALYSIS.md) outlining the architectural differences between AutoMem, CORE, mem0, and Zep/Graphiti, and providing a prioritized roadmap to close the performance gap (e.g., statement extraction, temporal reasoning, contradiction detection).

CORE-Compatible Evaluation Adapter:

Implemented a Node.js adapter (core_adapter/automem-search.js) that mimics CORE's search interface, enabling direct use of CORE's evaluation scripts with AutoMem as the backend. This allows for direct, fair benchmarking using CORE's own code and methodology.
Added documentation (core_adapter/README.md) describing how to use the adapter, run the evaluation, and verify results against CORE's published numbers.

Dependencies:

Added nltk==3.9.1 to requirements-dev.txt for official F1 metric computation in benchmark scripts.

These changes provide a robust foundation for transparent, reproducible benchmarking and set the stage for targeted improvements to AutoMem's retrieval and reasoning capabilities.

coderabbitai · 2025-12-08T23:11:29Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

📝 Walkthrough

Walkthrough

This PR adds a comprehensive benchmarking and experiment framework for AutoMem's LoCoMo evaluation. It introduces official LoCoMo metrics, a CORE-compatible evaluation adapter, an experiment configuration system with Railway-based deployment, an autonomous analysis agent for optimization, and various benchmark utilities. The framework supports multiple evaluation modes including end-to-end generation and lenient semantic scoring.

Changes

Cohort / File(s)	Change Summary
Git and Dependencies `.gitignore`, `requirements-dev.txt`	Added ignore pattern for benchmark results; added nltk==3.9.1 for official LoCoMo metrics.
Documentation and Analysis `tests/benchmarks/BENCHMARK_ANALYSIS_2025-12-03.md`, `tests/benchmarks/COMPETITOR_ANALYSIS.md`, `tests/benchmarks/REPRODUCIBILITY.md`, `tests/benchmarks/core_adapter/README.md`, `tests/benchmarks/experiments/README.md`	Added comprehensive markdown documentation covering benchmark results, competitor analysis, reproducibility guide, CORE adapter setup, and experiment framework overview.
LoCoMo Metrics and Evaluation `tests/benchmarks/locomo_metrics.py`, `tests/benchmarks/test_locomo.py`	New module implementing official LoCoMo F1 scoring, category-specific evaluation, and multi-answer handling; extended test_locomo.py with end-to-end generation mode, lenient semantic evaluation, official F1 tracking, and adversarial category (5) detection.
CORE-Compatible Benchmark Adapter `tests/benchmarks/core_adapter/automem-search.js`, `tests/benchmarks/core_adapter/evaluate.js`, `tests/benchmarks/core_adapter/package.json`	New Node.js module providing AutoMem API bridge to CORE evaluation pipeline, evaluation script with context retrieval and answer generation, and package configuration for ES module setup.
Experiment Configuration and Execution `tests/benchmarks/experiments/experiment_config.py`, `tests/benchmarks/experiments/experiment_runner.py`, `tests/benchmarks/experiments/railway_manager.py`	New Python modules defining experiment configuration system with enums and data structures, ExperimentRunner orchestrating grid search and ablation studies, and RailwayManager handling Railway/Docker deployments.
Autonomous Optimization and Analysis `tests/benchmarks/experiments/analysis_agent.py`, `tests/benchmarks/experiments/__init__.py`	New LLM-based analysis agent for autonomous optimization with iterative decision-making, and experiments package initialization exposing configuration and generation utilities.
Experiment and Benchmark Scripts `tests/benchmarks/experiments/run_optimization.sh`, `tests/benchmarks/experiments/test_e2e_mini.py`, `tests/benchmarks/experiments/test_minimal.py`, `tests/benchmarks/run_all_benchmarks.sh`, `tests/benchmarks/run_lenient_benchmark.sh`	Added orchestration shell scripts for optimization workflows and comprehensive benchmarking, and Python test scripts for minimal integration testing and end-to-end validation.
Test Updates `tests/test_api_endpoints.py`, `tests/test_embedding_providers.py`	Updated test expectations for OpenAI unavailability (503 response) and fallback embedding behavior when API keys are absent.

Sequence Diagrams

sequenceDiagram
    actor User
    participant Script as run_all_benchmarks.sh
    participant Python as test_locomo.py
    participant AutoMem
    participant OpenAI
    participant Metrics as locomo_metrics.py

    User->>Script: Execute benchmark
    Script->>Script: Verify AutoMem healthy
    Script->>Python: Run default mode
    loop Per Conversation/Question
        Python->>AutoMem: POST /memory (store)
        Python->>AutoMem: GET /recall (retrieve)
        Python->>OpenAI: Generate answer (e2e mode)
        Python->>Metrics: Score answer (official F1)
    end
    Python->>Script: Return results (JSON)
    Script->>Script: Display summary

sequenceDiagram
    actor User
    participant Runner as ExperimentRunner
    participant Config as ExperimentConfig
    participant Railway as RailwayManager
    participant Evaluator as LoCoMoEvaluator
    participant Agent as AnalysisAgent

    User->>Runner: run_grid_search(configs)
    loop Per Config
        Runner->>Railway: deploy_instance(config)
        Railway->>Railway: Wait for healthy
        Runner->>Evaluator: run_single_experiment()
        Evaluator->>Evaluator: Evaluate & score
        Runner->>Runner: Save result
    end
    Runner->>Agent: analyze_and_decide(results)
    Agent->>Agent: Format results & LLM prompt
    Agent->>Agent: Parse decision (continue/stop)
    alt Continue
        Agent->>Config: Generate next configs
        Agent->>Runner: run_grid_search(next_configs)
    else Stop/Converged
        Agent->>Agent: Save final report
    end

sequenceDiagram
    participant Client as CORE Evaluator
    participant Adapter as automem-search.js
    participant AutoMem
    participant OpenAI as OpenAI API
    participant Metrics as evaluate.js

    Client->>Adapter: search(query, userId)
    Adapter->>AutoMem: GET /recall
    AutoMem->>Adapter: Return memories
    Adapter->>Adapter: Format to episodes/facts
    Adapter->>Client: Return results
    Client->>OpenAI: Generate answer (with context)
    OpenAI->>Client: Return answer
    Client->>Metrics: Evaluate answer
    Metrics->>Metrics: Compute correctness label
    Metrics->>Client: Return CORRECT/WRONG

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Areas requiring extra attention:

railway_manager.py: Subprocess-based Railway CLI and Docker integration, health-check polling logic, error handling for external tool failures, and concurrent instance management with semaphore coordination.
experiment_runner.py: Async orchestration with parallel execution paths (Railway vs. local), mutation generation logic, configuration-to-environment translation, and result persistence patterns.
analysis_agent.py: LLM decision parsing from JSON output, fallback handling on parse failures, iterative loop termination logic, and history tracking for multi-iteration optimization.
test_locomo.py: New evaluation modes (e2e, lenient, official F1) integrated with existing retrieval flow, category-specific prompting, and adversarial detection logic require careful verification against LoCoMo spec.
core_adapter/evaluate.js: LLM integration with structured prompts per category, result caching, evaluation fallback heuristics, and error recovery across retrieval and generation steps.

Possibly related PRs

feat: Multi-hop recall with graph expansion and multi-query support #8: Modifies test_locomo.py to add multi-hop and graph-recall features alongside LoCoMo evaluation enhancements; shares test_locomo.py changes.
Experiment/multi hop v2 #13: Extends test_locomo.py with multi-hop embedding-based recall and semantic similarity evaluation; overlaps in evaluation logic and answer-checking methods.

Suggested labels

enhancement

Pre-merge checks and finishing touches

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Title check	❓ Inconclusive	The title 'Cleanup/benchmark' is vague and generic, using non-descriptive terms that don't clearly convey the substantial changes in this comprehensive PR about benchmarking infrastructure.	Revise title to be more specific and descriptive, such as 'Add comprehensive benchmarking infrastructure and CORE-compatible evaluation adapter' to better reflect the scope and intent of the changeset.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description check	✅ Passed	The description clearly relates to the changeset, detailing the benchmarking infrastructure, documentation, CORE-compatible adapter, and dependencies that are present in the PR changes.
Docstring Coverage	✅ Passed	Docstring coverage is 82.05% which is sufficient. The required threshold is 80.00%.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 10

🧹 Nitpick comments (36)

tests/test_embedding_providers.py (1)
545-554: Tighten env isolation to always exercise the auto‑selection fallback path

The new block correctly ensures OpenAI is skipped and FastEmbed fails so that placeholder is selected. However, the test still depends on the outer environment’s EMBEDDING_PROVIDER value: if it is set to "openai" or "local", init_embedding_provider will raise instead of falling back to the placeholder.

To make the test fully deterministic and aligned with the documented provider selection behavior, explicitly force the auto path here, e.g.:
with patch.dict(
    os.environ,
    {"EMBEDDING_PROVIDER": "auto", "OPENAI_API_KEY": ""},
    clear=False,
):
    with patch("automem.embedding.fastembed.TextEmbedding", side_effect=failing_init):
        ...
Optionally, since you no longer use the monkeypatch fixture in this test, you can drop it from the function signature or switch to monkeypatch.setenv for consistency with other tests. Based on learnings, EMBEDDING_PROVIDER is the primary switch for provider behavior.
tests/benchmarks/COMPETITOR_ANALYSIS.md (1)
30-32: Consider adding language specifiers to code blocks.

The ASCII diagrams would benefit from a language specifier like text or plaintext for consistency and to satisfy linters.
-```
+```text
 Episode → Normalization → Extraction → Resolution → Graph Integration
</blockquote></details>
<details>
<summary>tests/benchmarks/experiments/test_e2e_mini.py (3)</summary><blockquote>

`44-44`: **Remove extraneous f-prefix from strings without placeholders.**

Several print statements use f-strings without any placeholders.



```diff
-    print(f"\n[1/4] Storing memories...")
+    print("\n[1/4] Storing memories...")
Similar changes needed at lines 91, 158.

100-100: Remove unused variable category.

The category variable is assigned but never used.
-        category = qa.get('category', 1)
+        # category = qa.get('category', 1)  # Available for future use
Or simply remove the line if not needed.

77-78: Consider catching more specific exceptions.

Broad Exception catches can mask unexpected errors. Consider catching requests.RequestException for HTTP-related failures.
-            except Exception as e:
+            except requests.RequestException as e:
                 print(f"  Error storing: {e}")
tests/benchmarks/experiments/experiment_config.py (3)
10-10: Remove unused import.

itertools is imported but never used in this file.
-import itertools
14-36: Consider using the defined Enums in ExperimentConfig.

The EmbeddingModel, EnrichmentModel, and RecallStrategy enums are defined but not used - ExperimentConfig uses plain str types instead. Either use the enums for type safety or remove them to avoid confusion.

Option 1 - Use enums (provides validation):
embedding_model: EmbeddingModel = EmbeddingModel.OPENAI_SMALL
Option 2 - Keep strings but remove unused enums if they're not needed elsewhere.

126-152: Guard against missing keys in custom weights.

If a caller provides partial weights dict, score() will raise KeyError. Consider using .get() with defaults or validating the weights dict.
     def score(self, weights: Optional[Dict[str, float]] = None) -> float:
-        """
-        Calculate weighted score for this configuration.
-        Higher is better.
-        """
-        if weights is None:
-            weights = {
+        """Calculate weighted score for this configuration. Higher is better."""
+        default_weights = {
                 "accuracy": 0.4,
                 "response_time": 0.2,
-                "multi_hop": 0.2,  # Emphasize our weakness
+                "multi_hop": 0.2,
                 "cost": 0.1,
                 "temporal": 0.1,
             }
+        if weights is None:
+            weights = default_weights
+        else:
+            # Merge with defaults to handle partial weights
+            weights = {**default_weights, **weights}
tests/benchmarks/core_adapter/README.md (1)
8-8: Format the URL as a markdown link.

The bare URL should be formatted as a proper markdown link for better rendering and accessibility.

Apply this diff:
-CORE (https://github.com/RedPlanetHQ/core-benchmark) claims:
+[CORE](https://github.com/RedPlanetHQ/core-benchmark) claims:
tests/benchmarks/experiments/test_minimal.py (2)
1-1: Consider making the file executable or removing the shebang.

The shebang is present but the file lacks execute permissions. Either add execute permissions with chmod +x or remove the shebang if the script is not intended to be executed directly.

121-127: Unused unpacked variable.

The explanation variable is unpacked but never used. Consider using _ to indicate it's intentionally ignored.

Apply this diff:
-            is_correct, confidence, explanation = evaluator._evaluate_lenient_semantic(
+            is_correct, confidence, _ = evaluator._evaluate_lenient_semantic(
                 question, expected, generated, category
             )
         else:
-            is_correct, confidence, explanation = evaluator.check_answer_in_memories(
+            is_correct, confidence, _ = evaluator.check_answer_in_memories(
                 question, expected, recalled, category
             )
tests/benchmarks/core_adapter/evaluate.js (1)
244-244: Clarify the context slice limit.

The search limit is set to 20 episodes (line 32), but only the first 10 are used for context (line 244). Consider documenting why this discrepancy exists or aligning the limits.

If the 10-episode limit is intentional for context window management, add a comment:
-      const context = searchResults.episodes.slice(0, 10).join("\n\n");
+      // Limit to 10 episodes for context window management (even though we retrieve 20)
+      const context = searchResults.episodes.slice(0, 10).join("\n\n");
tests/benchmarks/core_adapter/automem-search.js (1)
43-97: Consider logging more details on search failures for debugging.

The error handler at line 94 only logs error.message, which may omit useful debugging info like HTTP status codes or response bodies. For a benchmark adapter, this could make diagnosing failures difficult.
     } catch (error) {
-      console.error("AutoMem search error:", error.message);
+      console.error("AutoMem search error:", error.message, error.response?.status, error.response?.data);
       return { episodes: [], facts: [] };
     }
tests/benchmarks/test_locomo.py (5)
738-767: Unused variable found_in_memory should be removed.

The variable found_in_memory is assigned at line 752 but never used. This is flagged by static analysis.
                 if overlap_ratio > max_overlap:
                     max_overlap = overlap_ratio
-                    found_in_memory = mem.get("id")
1077-1085: Use explicit Optional type hints per PEP 484.

The type hints on lines 1083-1084 use implicit None defaults without Optional. Per coding guidelines and PEP 484, use explicit Optional[T] or T | None.
     def check_answer_in_memories(
         self, 
         question: str,
         expected_answer: Any,
         recalled_memories: List[Dict[str, Any]],
-        evidence_dialog_ids: List[str] = None,
-        sample_id: str = None,
+        evidence_dialog_ids: Optional[List[str]] = None,
+        sample_id: Optional[str] = None,
         category: int = 0,
     ) -> Tuple[bool, float, str]:
Note: You'll need to add Optional to the imports from typing.

1438-1438: Use explicit Optional for conversation_ids parameter.

Same PEP 484 issue - use Optional[List[str]] instead of implicit None.
-    def run_benchmark(self, cleanup_after: bool = True, conversation_ids: List[str] = None) -> Dict[str, Any]:
+    def run_benchmark(self, cleanup_after: bool = True, conversation_ids: Optional[List[str]] = None) -> Dict[str, Any]:
699-728: Consider using question parameter for improved adversarial detection.

The question parameter is unused (static analysis ARG002). While the current implementation works, the question context could improve adversarial detection accuracy (e.g., checking if question keywords appear in memories).

If intentionally unused for interface consistency, consider prefixing with underscore:
     def _evaluate_adversarial(
         self,
-        question: str,
+        _question: str,  # Reserved for future use
         adversarial_answer: str,
         recalled_memories: List[Dict[str, Any]],
     ) -> Tuple[bool, float, str]:
851-854: Consider moving return statement to else block for cleaner control flow.

Per static analysis hint TRY300, the return at line 849 could be in an else block for clearer success/error separation.
         try:
             response = self.openai_client.chat.completions.create(
                 model=self.config.e2e_model,
                 # ...
             )

             answer = response.choices[0].message.content.strip()
             self.e2e_cache[cache_key] = answer
-            return answer
-
-        except Exception as e:
+        except Exception as e:
             print(f"⚠️  E2E generation error: {e}")
             return "error generating answer"
+        else:
+            return answer
tests/benchmarks/locomo_metrics.py (3)
24-28: NLTK download may fail silently in CI/production environments.

The LookupError catch downloads punkt, but PorterStemmer doesn't require punkt - it's part of nltk.stem. The download might be unnecessary, and network-dependent initialization could fail in restricted environments.
 # Initialize stemmer
-try:
-    ps = PorterStemmer()
-except LookupError:
-    nltk.download("punkt")
-    ps = PorterStemmer()
+ps = PorterStemmer()  # PorterStemmer doesn't require data downloads
If you do need NLTK data for tokenization elsewhere, consider downloading during setup/install rather than at import time.

173-208: Remove unused question parameter or document its purpose.

The question parameter is never used in the function body. Either remove it or add a leading underscore to indicate it's intentionally unused (e.g., for API consistency).
 def extract_answer_from_context(
-    memories: List[str], question: str, answer_hint: str = None
+    memories: List[str], _question: str, answer_hint: Optional[str] = None
 ) -> str:
Also, use explicit Optional[str] for answer_hint per PEP 484.

358-364: Unused loop variable cat - use underscore prefix or use it.

The loop iterates over summary["categories"].items() but only uses data, not cat.
-        for cat, data in summary["categories"].items():
+        for _cat, data in summary["categories"].items():
             print(
                 f"  {data['name']:35s}: "
tests/benchmarks/experiments/analysis_agent.py (4)
259-265: Remove extraneous f prefix from strings without placeholders.

Lines 259, 262, and 295 have f-strings without any placeholders.
-            print(f"\n🤔 Analyzing results...")
+            print("\n🤔 Analyzing results...")
             decision = await self.analyze_and_decide(all_results)
             
-            print(f"\n📋 Agent Decision:")
+            print("\n📋 Agent Decision:")
             print(f"   Continue: {decision.get('continue_exploring', False)}")
And at line 295:
-            print(f"\nFinal Config:")
+            print("\nFinal Config:")
354-362: Unused best_config variable.

The result of run_autonomous_optimization is assigned to best_config but never used afterward.
-    best_config = await agent.run_autonomous_optimization(runner=runner)
+    await agent.run_autonomous_optimization(runner=runner)
     
     print("\n✅ Optimization complete!")
-    print(f"Best configuration saved to: {args.output_dir}")
+    print(f"Best configuration saved to: {args.output_dir}/")
143-150: Broad exception catch loses error context.

Catching bare Exception and returning a fallback dict hides the error type. Consider logging the full traceback for debugging.
+        import traceback
         except Exception as e:
-            print(f"❌ Analysis failed: {e}")
+            print(f"❌ Analysis failed: {e}\n{traceback.format_exc()}")
             return {
                 "continue_exploring": False,
1-1: Shebang without executable permission.

The file has a shebang (#!/usr/bin/env python3) but isn't marked executable. Either remove the shebang or make the file executable.

Run chmod +x tests/benchmarks/experiments/analysis_agent.py or remove line 1.
tests/benchmarks/experiments/railway_manager.py (3)
91-92: Hardcoded /tmp path is insecure and non-portable.

Using /tmp directly can lead to symlink attacks or race conditions. Use tempfile module for secure temporary file handling.
+import tempfile

 # In deploy_instance:
-                    cwd="/tmp"
+                    cwd=tempfile.gettempdir()
166-169: Use tempfile.NamedTemporaryFile for secure env file creation.

Writing to /tmp/{instance_name}.env with predictable names is vulnerable to race conditions.
+import tempfile

-        env_file = f"/tmp/{instance_name}.env"
-        with open(env_file, "w") as f:
+        with tempfile.NamedTemporaryFile(mode="w", suffix=".env", delete=False) as f:
+            env_file = f.name
             for key, value in env_vars.items():
                 f.write(f"{key}={value}\n")
266-268: Silent exception swallowing hides health check failures.

The try-except-pass pattern makes debugging impossible when health checks fail unexpectedly. At minimum, log the error.
                 except Exception:
-                    pass
+                    pass  # Connection refused or timeout is expected during startup
                 await asyncio.sleep(5)
Or consider logging at debug level:
+import logging
+logger = logging.getLogger(__name__)

                 except Exception as e:
+                    logger.debug(f"Health check pending: {e}")
                     pass
tests/benchmarks/experiments/experiment_runner.py (8)
110-110: Remove redundant import.

The json module is already imported at line 15. This local import on line 110 is unnecessary.
-            import json
             with open(locomo_config.data_file, 'r') as f:
292-294: Use proper dataclass copying instead of __dict__.

Using __dict__.copy() to copy the config is fragile and may not handle nested objects or future dataclass fields correctly. If ExperimentConfig is a dataclass (which seems likely from the usage), use dataclasses.replace() or proper copying.
+from dataclasses import replace
+
             # Create mutated config
-            config_dict = config.__dict__.copy()
-            config_dict["name"] = f"{config.name}_mut{i}_{param}"
-            config_dict[param] = value
+            kwargs = {
+                "name": f"{config.name}_mut{i}_{param}",
+                param: value
+            }
             
             # Keep weights normalized
             if param in ["vector_weight", "graph_weight"]:
                 other = "graph_weight" if param == "vector_weight" else "vector_weight"
-                config_dict[other] = 1.0 - value
+                kwargs[other] = 1.0 - value
             
-            mutations.append(ExperimentConfig(**config_dict))
+            mutations.append(replace(config, **kwargs))
297-299: Weight normalization logic may not preserve constraint.

If the randomly selected parameter to mutate happens to be a weight parameter, the logic normalizes the complementary weight. However, this assumes vector_weight + graph_weight = 1.0. If the original config doesn't satisfy this constraint or if other code allows different totals, this could introduce inconsistency.

Consider adding an assertion or validation to ensure the weight constraint is maintained:
             # Keep weights normalized
             if param in ["vector_weight", "graph_weight"]:
                 other = "graph_weight" if param == "vector_weight" else "vector_weight"
                 config_dict[other] = 1.0 - value
+                # Validate constraint
+                assert abs(config_dict["vector_weight"] + config_dict["graph_weight"] - 1.0) < 0.001
339-339: Remove unused variable worst.

The variable worst is assigned but never used in the report generation.
         sorted_results = sorted(self.results, key=lambda r: r.score(), reverse=True)
         best = sorted_results[0]
-        worst = sorted_results[-1]
         baseline = next((r for r in self.results if r.config.name == "baseline"), sorted_results[-1])
382-382: Remove unnecessary f-string prefixes.

Several f-strings don't contain any placeholders and should use regular strings instead.
-        report.append(f"\n2. Multi-hop Reasoning (our weakness):")
+        report.append("\n2. Multi-hop Reasoning (our weakness):")
 
-        report.append(f"\n3. Temporal Understanding:")
+        report.append("\n3. Temporal Understanding:")
 
-            report.append(f"\n4. Response Times:")
+            report.append("\n4. Response Times:")
 
-            report.append(f"\nEmbedding Model Impact:")
+            report.append("\nEmbedding Model Impact:")
 
-            report.append(f"\nFact Extraction Impact:")
+            report.append("\nFact Extraction Impact:")
Also applies to: 392-392, 399-399, 414-414, 425-425

631-631: Remove unnecessary f-string prefix.

The f-string on line 631 doesn't contain any placeholders.
-            print(f"\n🎯 Optimal configuration found:")
+            print("\n🎯 Optimal configuration found:")
649-649: Remove or use the results variable in ablation mode.

In ablation mode, the results variable is assigned but never used. Either use it to generate specific ablation insights or remove the assignment.
-            results = await runner.run_ablation_study(args.param, values)
+            await runner.run_ablation_study(args.param, values)
Or if you want to use the results:
             results = await runner.run_ablation_study(args.param, values)
+            # Ablation-specific analysis could go here
+            print(f"\n📊 Ablation study completed with {len(results)} variants")
655-662: Railway cleanup logic is correct but could be clarified.

The Railway cleanup appears in both the except KeyboardInterrupt block and the finally block. While this ensures cleanup happens regardless of how the program exits, calling destroy_all() twice could be wasteful. Consider using a flag to track whether cleanup has already occurred, or rely solely on the finally block.
     except KeyboardInterrupt:
         print("\n⚠️ Interrupted! Cleaning up...")
-        if args.railway:
-            await runner.instance_manager.destroy_all()
     
     finally:
         if args.railway:
             await runner.instance_manager.destroy_all()
The finally block will execute even after the except block, so removing the duplicate call from the except handler should be sufficient. The print statement can remain to inform the user.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between abc689d and 34a2f76.

⛔ Files ignored due to path filters (1)

tests/benchmarks/core_adapter/package-lock.json is excluded by !**/package-lock.json

📒 Files selected for processing (24)

.gitignore (1 hunks)
requirements-dev.txt (1 hunks)
tests/benchmarks/BENCHMARK_ANALYSIS_2025-12-03.md (1 hunks)
tests/benchmarks/COMPETITOR_ANALYSIS.md (1 hunks)
tests/benchmarks/REPRODUCIBILITY.md (1 hunks)
tests/benchmarks/core_adapter/README.md (1 hunks)
tests/benchmarks/core_adapter/automem-search.js (1 hunks)
tests/benchmarks/core_adapter/evaluate.js (1 hunks)
tests/benchmarks/core_adapter/package.json (1 hunks)
tests/benchmarks/experiments/README.md (1 hunks)
tests/benchmarks/experiments/__init__.py (1 hunks)
tests/benchmarks/experiments/analysis_agent.py (1 hunks)
tests/benchmarks/experiments/experiment_config.py (1 hunks)
tests/benchmarks/experiments/experiment_runner.py (1 hunks)
tests/benchmarks/experiments/railway_manager.py (1 hunks)
tests/benchmarks/experiments/run_optimization.sh (1 hunks)
tests/benchmarks/experiments/test_e2e_mini.py (1 hunks)
tests/benchmarks/experiments/test_minimal.py (1 hunks)
tests/benchmarks/locomo_metrics.py (1 hunks)
tests/benchmarks/run_all_benchmarks.sh (1 hunks)
tests/benchmarks/run_lenient_benchmark.sh (1 hunks)
tests/benchmarks/test_locomo.py (16 hunks)
tests/test_api_endpoints.py (5 hunks)
tests/test_embedding_providers.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (3)

tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use pytest with DummyGraph fixture to mock FalkorDB operations for unit testing

Files:

tests/benchmarks/experiments/test_e2e_mini.py
tests/test_api_endpoints.py
tests/test_embedding_providers.py
tests/benchmarks/experiments/__init__.py
tests/benchmarks/experiments/test_minimal.py
tests/benchmarks/experiments/analysis_agent.py
tests/benchmarks/locomo_metrics.py
tests/benchmarks/experiments/experiment_config.py
tests/benchmarks/experiments/railway_manager.py
tests/benchmarks/test_locomo.py
tests/benchmarks/experiments/experiment_runner.py

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Use black for Python code formatting
Use flake8 for Python linting

**/*.py: Use Python with type hints. Indent 4 spaces; line length 100 (Black).
Use snake_case for module and function names in Python.
Use PascalCase for class names in Python.
Use UPPER_SNAKE_CASE for constants in Python.
Format code with Black (line length 100) and organize imports with Isort (profile=black).
Lint with Flake8 and ensure CI passes before committing.
Add or adjust tests for new endpoints, stores, or utils; maintain comprehensive test coverage.

Files:

tests/benchmarks/experiments/test_e2e_mini.py
tests/test_api_endpoints.py
tests/test_embedding_providers.py
tests/benchmarks/experiments/__init__.py
tests/benchmarks/experiments/test_minimal.py
tests/benchmarks/experiments/analysis_agent.py
tests/benchmarks/locomo_metrics.py
tests/benchmarks/experiments/experiment_config.py
tests/benchmarks/experiments/railway_manager.py
tests/benchmarks/test_locomo.py
tests/benchmarks/experiments/experiment_runner.py

tests/test_*.py

📄 CodeRabbit inference engine (AGENTS.md)

tests/test_*.py: Place tests in tests/ directory named test_*.py using Pytest framework.
Prefer fixtures over globals in Pytest tests.

Files:

tests/test_api_endpoints.py
tests/test_embedding_providers.py

🧠 Learnings (8)

📚 Learning: 2025-12-02T05:34:25.275Z

Learnt from: CR
Repo: verygoodplugins/automem PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-02T05:34:25.275Z
Learning: Applies to **/*.py : Add or adjust tests for new endpoints, stores, or utils; maintain comprehensive test coverage.

Applied to files:

tests/benchmarks/experiments/test_e2e_mini.py
tests/benchmarks/experiments/test_minimal.py

📚 Learning: 2025-12-02T05:34:12.678Z

Learnt from: CR
Repo: verygoodplugins/automem PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-02T05:34:12.678Z
Learning: Applies to app.py : Implement 13 API endpoints covering core memory operations (POST /memory, GET /recall, PATCH /memory/<id>, DELETE /memory/<id>, GET /memory/by-tag), relationship management (POST /associate), consolidation (POST /consolidate, GET /consolidate/status), health checks (GET /health), enrichment (GET /enrichment/status, POST /enrichment/reprocess), analysis (GET /analyze), and startup recall (GET /startup-recall)

Applied to files:

tests/benchmarks/experiments/test_e2e_mini.py

📚 Learning: 2025-12-02T05:34:12.678Z

Learnt from: CR
Repo: verygoodplugins/automem PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-02T05:34:12.678Z
Learning: Applies to automem/embedding/*.py : Implement embedding generation using a provider pattern with three backends in priority order: OpenAI (if OPENAI_API_KEY set), FastEmbed (local ONNX model), and Placeholder (hash-based fallback)

Applied to files:

tests/test_api_endpoints.py
tests/test_embedding_providers.py

📚 Learning: 2025-12-02T05:34:12.678Z

Learnt from: CR
Repo: verygoodplugins/automem PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-02T05:34:12.678Z
Learning: Applies to automem/embedding/*.py : Implement EmbeddingProvider abstraction with three implementations: OpenAIEmbeddingProvider, FastEmbedProvider, and PlaceholderEmbeddingProvider

Applied to files:

tests/test_embedding_providers.py

📚 Learning: 2025-12-02T05:34:12.678Z

Learnt from: CR
Repo: verygoodplugins/automem PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-02T05:34:12.678Z
Learning: Control embedding provider via EMBEDDING_PROVIDER environment variable with values: auto (default), openai, local, or placeholder

Applied to files:

tests/test_embedding_providers.py

📚 Learning: 2025-12-02T05:34:25.275Z

Learnt from: CR
Repo: verygoodplugins/automem PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-02T05:34:25.275Z
Learning: Use the Agent Memory Protocol: fetch context with recall at task start, persist key steps with POST `/memory`, tag with conventions (`agent`, `repo:<name>`, `task:<id>`, `file:<path>`, `pr:<num>`, `issue:<num>`, `result`, `decision`, `docs`), and associate related memories.

Applied to files:

tests/benchmarks/experiments/README.md

📚 Learning: 2025-12-02T05:34:25.275Z

Learnt from: CR
Repo: verygoodplugins/automem PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-02T05:34:25.275Z
Learning: Core package is located at `automem/` with notable directories: `api/` (Flask blueprints), `utils/`, `stores/`, and `config.py`.

Applied to files:

tests/benchmarks/experiments/__init__.py
tests/benchmarks/core_adapter/package.json

📚 Learning: 2025-12-02T05:34:12.678Z

Learnt from: CR
Repo: verygoodplugins/automem PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-02T05:34:12.678Z
Learning: Applies to **/*enrichment*.py : Detect recurring patterns per memory type, strengthen shared Pattern nodes, and link memories via EXEMPLIFIES relationships with key terms

Applied to files:

tests/benchmarks/experiments/experiment_config.py

🧬 Code graph analysis (10)

tests/benchmarks/core_adapter/automem-search.js (1)

mcp-sse-server/server.js (1)

userId (333-333)

tests/benchmarks/experiments/test_e2e_mini.py (3)

tests/benchmarks/experiments/experiment_config.py (1)

ExperimentConfig (40-92)

tests/benchmarks/locomo_metrics.py (1)

f1_score (59-89)

tests/conftest.py (1)

OpenAI (128-131)

tests/benchmarks/experiments/run_optimization.sh (2)

tests/benchmarks/experiments/analysis_agent.py (1)

main (336-362)

tests/benchmarks/experiments/experiment_runner.py (1)

main (514-662)

tests/benchmarks/core_adapter/evaluate.js (2)

tests/benchmarks/core_adapter/automem-search.js (1)

AutoMemSearchService (15-110)

tests/conftest.py (1)

OpenAI (128-131)

tests/test_embedding_providers.py (1)

app.py (1)

init_embedding_provider (1088-1180)

tests/benchmarks/experiments/__init__.py (1)

tests/benchmarks/experiments/experiment_config.py (4)

ExperimentConfig (40-92)

ExperimentResult (96-152)

generate_experiment_grid (155-242)

generate_focused_experiments (245-260)

tests/benchmarks/experiments/test_minimal.py (2)

tests/benchmarks/experiments/experiment_config.py (1)

ExperimentConfig (40-92)

tests/benchmarks/test_locomo.py (1)

LoCoMoConfig (54-104)

tests/benchmarks/experiments/analysis_agent.py (2)

tests/benchmarks/experiments/experiment_config.py (5)

ExperimentConfig (40-92)

ExperimentResult (96-152)

score (126-152)

generate_experiment_grid (155-242)

to_json (91-92)

tests/benchmarks/experiments/experiment_runner.py (3)

ExperimentRunner (47-511)

run_grid_search (152-180)

main (514-662)

tests/benchmarks/locomo_metrics.py (1)

tests/benchmarks/test_locomo.py (1)

normalize_answer (538-571)

tests/benchmarks/experiments/railway_manager.py (1)

tests/benchmarks/experiments/experiment_config.py (2)

ExperimentConfig (40-92)

to_env_vars (70-89)

🪛 LanguageTool

tests/benchmarks/experiments/README.md

[grammar] ~160-~160: Ensure spelling is correct
Context: ...Monitor response times - stay under 100ms for production 4. Watch token costs...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

tests/benchmarks/COMPETITOR_ANALYSIS.md

[style] ~189-~189: To elevate your writing, try using an alternative expression here.
Context: ...ompetitive advantage. --- ## Why This Matters for LoCoMo Categories | Category | Gap...

(MATTERS_RELEVANT)

🪛 markdownlint-cli2 (0.18.1)

tests/benchmarks/core_adapter/README.md

8-8: Bare URL used

(MD034, no-bare-urls)

tests/benchmarks/experiments/README.md

8-8: Bare URL used

(MD034, no-bare-urls)

tests/benchmarks/COMPETITOR_ANALYSIS.md

32-32: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

105-105: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

162-162: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

🪛 Ruff (0.14.7)

tests/benchmarks/experiments/test_e2e_mini.py

1-1: Shebang is present but file is not executable

(EXE001)

44-44: f-string without any placeholders

Remove extraneous f prefix

(F541)

77-77: Do not catch blind exception: Exception

(BLE001)

91-91: f-string without any placeholders

Remove extraneous f prefix

(F541)

100-100: Local variable category is assigned to but never used

Remove assignment to unused variable category

(F841)

106-106: Probable use of requests call without timeout

(S113)

137-137: Do not catch blind exception: Exception

(BLE001)

158-158: f-string without any placeholders

Remove extraneous f prefix

(F541)

159-159: Probable use of requests call without timeout

(S113)

161-161: f-string without any placeholders

Remove extraneous f prefix

(F541)

tests/benchmarks/experiments/test_minimal.py

1-1: Shebang is present but file is not executable

(EXE001)

34-34: f-string without any placeholders

Remove extraneous f prefix

(F541)

50-50: f-string without any placeholders

Remove extraneous f prefix

(F541)

58-58: f-string without any placeholders

Remove extraneous f prefix

(F541)

63-63: Prefer next(iter(evaluator.conversations.keys())) over single element slice

Replace with next(iter(evaluator.conversations.keys()))

(RUF015)

68-68: f-string without any placeholders

Remove extraneous f prefix

(F541)

73-73: f-string without any placeholders

Remove extraneous f prefix

(F541)

78-78: f-string without any placeholders

Remove extraneous f prefix

(F541)

87-87: f-string without any placeholders

Remove extraneous f prefix

(F541)

91-91: f-string without any placeholders

Remove extraneous f prefix

(F541)

125-125: Unpacked variable explanation is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

138-138: f-string without any placeholders

Remove extraneous f prefix

(F541)

tests/benchmarks/experiments/analysis_agent.py

1-1: Shebang is present but file is not executable

(EXE001)

141-141: Consider moving this statement to an else block

(TRY300)

143-143: Do not catch blind exception: Exception

(BLE001)

259-259: f-string without any placeholders

Remove extraneous f prefix

(F541)

262-262: f-string without any placeholders

Remove extraneous f prefix

(F541)

295-295: f-string without any placeholders

Remove extraneous f prefix

(F541)

359-359: Local variable best_config is assigned to but never used

Remove assignment to unused variable best_config

(F841)

tests/benchmarks/locomo_metrics.py

174-174: Unused function argument: question

(ARG001)

174-174: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

358-358: Loop control variable cat not used within loop body

Rename unused cat to _cat

(B007)

tests/benchmarks/experiments/railway_manager.py

55-55: Starting a process with a partial executable path

(S607)

60-60: Consider moving this statement to an else block

(TRY300)

61-61: Do not catch blind exception: Exception

(BLE001)

82-82: subprocess call: check for execution of untrusted input

(S603)

83-87: Starting a process with a partial executable path

(S607)

91-91: Probable insecure usage of temporary file or directory: "/tmp"

(S108)

104-104: subprocess call: check for execution of untrusted input

(S603)

105-105: Starting a process with a partial executable path

(S607)

112-112: Starting a process with a partial executable path

(S607)

141-141: Consider moving this statement to an else block

(TRY300)

143-143: Do not catch blind exception: Exception

(BLE001)

166-166: Probable insecure usage of temporary file or directory: "/tmp/"

(S108)

173-173: subprocess call: check for execution of untrusted input

(S603)

174-180: Starting a process with a partial executable path

(S607)

205-205: Consider moving this statement to an else block

(TRY300)

207-207: Do not catch blind exception: Exception

(BLE001)

216-216: subprocess call: check for execution of untrusted input

(S603)

217-217: Starting a process with a partial executable path

(S607)

224-224: Starting a process with a partial executable path

(S607)

239-239: Consider moving this statement to an else block

(TRY300)

241-241: Do not catch blind exception: Exception

(BLE001)

266-267: try-except-pass detected, consider logging the exception

(S110)

266-266: Do not catch blind exception: Exception

(BLE001)

271-271: Unused method argument: project_id

(ARG002)

274-274: Starting a process with a partial executable path

(S607)

325-325: Consider moving this statement to an else block

(TRY300)

326-326: Do not catch blind exception: Exception

(BLE001)

340-340: Consider moving this statement to an else block

(TRY300)

341-341: Do not catch blind exception: Exception

(BLE001)

tests/benchmarks/test_locomo.py

701-701: Unused method argument: question

(ARG002)

752-752: Local variable found_in_memory is assigned to but never used

Remove assignment to unused variable found_in_memory

(F841)

849-849: Consider moving this statement to an else block

(TRY300)

851-851: Do not catch blind exception: Exception

(BLE001)

935-935: Consider moving this statement to an else block

(TRY300)

937-937: Do not catch blind exception: Exception

(BLE001)

1083-1083: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

1084-1084: Unused method argument: category

(ARG002)

1438-1438: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

tests/benchmarks/experiments/experiment_runner.py

1-1: Shebang is present but file is not executable

(EXE001)

137-137: Do not catch blind exception: Exception

(BLE001)

288-288: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)

289-289: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)

339-339: Local variable worst is assigned to but never used

Remove assignment to unused variable worst

(F841)

382-382: f-string without any placeholders

Remove extraneous f prefix

(F541)

392-392: f-string without any placeholders

Remove extraneous f prefix

(F541)

399-399: f-string without any placeholders

Remove extraneous f prefix

(F541)

414-414: f-string without any placeholders

Remove extraneous f prefix

(F541)

425-425: f-string without any placeholders

Remove extraneous f prefix

(F541)

631-631: f-string without any placeholders

Remove extraneous f prefix

(F541)

649-649: Local variable results is assigned to but never used

Remove assignment to unused variable results

(F841)

🪛 Shellcheck (0.11.0)

tests/benchmarks/experiments/run_optimization.sh

[error] 143-143: -f doesn't work with globs. Use a for loop.

(SC2144)

🔇 Additional comments (18)

.gitignore (1)

25-26: LGTM!

The ignore pattern correctly prevents committing temporary experiment results generated by the benchmark suite, and the comment clarifies the intended workflow. The pattern syntax is consistent with existing rules in the file (e.g., line 17).

tests/test_api_endpoints.py (2)

16-16: LGTM: Good use of configuration constant.

Using config.VECTOR_SIZE instead of a hardcoded value improves maintainability and ensures the test stays synchronized with the actual configuration.

Also applies to: 918-918

619-637: LGTM: Test correctly validates OpenAI requirement for reembed.

The updated test expectations properly reflect that the reembed operation requires OpenAI to be available. The 503 status code and error message validation ensure appropriate failure handling when OpenAI is unavailable.

tests/benchmarks/REPRODUCIBILITY.md (1)

1-216: Well-structured reproducibility documentation.

The guide covers environment setup, multiple benchmark modes, result verification, and troubleshooting comprehensively. The citation guidance at the end is a nice touch for academic/comparative use.

tests/benchmarks/experiments/README.md (1)

1-228: Comprehensive experiment framework documentation.

The README thoroughly documents grid search, AI-guided optimization, ablation studies, and Railway deployment. The ASCII diagrams and example JSON help illustrate the workflow clearly.

Note: References to "GPT-5.1" (lines 48, 110, 194) appear to be placeholders for future models - ensure these are updated when actual models are available or clarify they're hypothetical.

tests/benchmarks/experiments/__init__.py (1)

1-21: LGTM - Clean package initialization.

The __init__.py properly establishes the public API surface with explicit __all__ declaration matching the imports.

tests/benchmarks/COMPETITOR_ANALYSIS.md (1)

1-244: Excellent competitive analysis document.

The analysis clearly identifies the performance gap causes (fact extraction, temporal reasoning) and provides actionable implementation priorities. The phased approach (Quick Wins → Statement Extraction → Contradiction Resolution) is pragmatic.

tests/benchmarks/experiments/experiment_config.py (2)

254-260: Potential issue with dataclass field copying.

Using base_config.__dict__.copy() works for simple dataclasses but may not handle nested objects or default_factory fields correctly. The errors field in ExperimentResult uses field(default_factory=list), so ensure this pattern isn't applied there.

This is acceptable for ExperimentConfig since all fields are simple types. Just ensure generate_focused_experiments is only used with ExperimentConfig, not ExperimentResult.

39-93: LGTM - Well-structured configuration dataclass.

The ExperimentConfig class is well-designed with sensible defaults and the to_env_vars() method provides clean Railway integration.

tests/benchmarks/core_adapter/evaluate.js (1)

37-38: Model name "gpt-5.1" is valid and correctly used.

The default model configuration correctly references "gpt-5.1", which OpenAI released on November 12, 2025. No changes are needed.
tests/benchmarks/core_adapter/automem-search.js (2)
1-113: LGTM overall - clean adapter implementation.

The adapter correctly implements the CORE-compatible interface with appropriate timeout, auth headers, and format conversion. The structure is well-documented.

76-85: Defensive check for r.memory.content before accessing.

Line 82 accesses r.memory.content without verifying r.memory exists (unlike the filter at line 78 which safely uses optional chaining). If r.memory is undefined, this would throw.
       .filter((r) => {
         const tags = r.memory?.tags || [];
         return tags.some((t) => t.startsWith("entity:"));
       })
       .map((r) => ({
-        fact: r.memory.content,
-        validAt: r.memory.metadata?.session_datetime || null,
-        source: r.memory.id,
+        fact: r.memory?.content || "",
+        validAt: r.memory?.metadata?.session_datetime || null,
+        source: r.memory?.id,
       }));
Likely an incorrect or invalid review comment.
tests/benchmarks/locomo_metrics.py (1)

59-89: LGTM - F1 implementation matches official LoCoMo.

The token-level F1 with Porter stemming correctly implements the official metric from the LoCoMo paper.

tests/benchmarks/experiments/railway_manager.py (2)

20-29: LGTM - RailwayInstance dataclass is well-structured.

Clean dataclass definition with appropriate fields for tracking deployment state.

82-108: No subprocess security issue here—arguments are already properly passed as a list.

The subprocess.run() call uses the list form without shell=True, so arguments bypass shell parsing entirely. Environment variables from config.to_env_vars() are safe even if they contain special characters, since ["railway", "variables", "set", f"{key}={value}"] passes the concatenated string directly to the Railway CLI without shell interpretation.
tests/benchmarks/test_locomo.py (1)
1656-1676: Same invalid model names in CLI defaults.

Lines 1659 and 1675 reference gpt-5.1 which doesn't exist.
     parser.add_argument(
         "--e2e-model",
-        default="gpt-5.1",
-        help="Model for E2E answer generation (default: gpt-5.1, options: gpt-4.1, gpt-4o-mini)",
+        default="gpt-4o",
+        help="Model for E2E answer generation (default: gpt-4o, options: gpt-4-turbo, gpt-4o-mini)",
     )
     # ...
     parser.add_argument(
         "--eval-judge-model",
-        default="gpt-5.1",
-        help="Model for lenient evaluation judging (default: gpt-5.1)",
+        default="gpt-4o",
+        help="Model for lenient evaluation judging (default: gpt-4o)",
     )
⛔ Skipped due to learnings
Learnt from: jack-arturo
Repo: verygoodplugins/automem PR: 19
File: tests/benchmarks/experiments/analysis_agent.py:69-69
Timestamp: 2025-12-08T19:21:25.845Z
Learning: OpenAI released GPT-5.1 on November 12, 2025. The model identifier "gpt-5.1" is valid for API usage in the automem codebase.
tests/benchmarks/experiments/analysis_agent.py (1)
69-69: Invalid model name "gpt-5.1".

Same issue as in test_locomo.py - gpt-5.1 doesn't exist.
     def __init__(
         self,
-        model: str = "gpt-5.1",
+        model: str = "gpt-4o",
         results_dir: str = "experiment_results",
⛔ Skipped due to learnings
Learnt from: jack-arturo
Repo: verygoodplugins/automem PR: 19
File: tests/benchmarks/experiments/analysis_agent.py:69-69
Timestamp: 2025-12-08T19:21:25.845Z
Learning: OpenAI released GPT-5.1 on November 12, 2025. The model identifier "gpt-5.1" is valid for API usage in the automem codebase.
tests/benchmarks/experiments/experiment_runner.py (1)
115-115: Use safe dictionary access to prevent KeyError.

If a conversation in the data file doesn't have a sample_id key, this will raise a KeyError.
-            conv_ids = [c.get('sample_id') for c in convs[:self.num_conversations]]
+            conv_ids = [c.get('sample_id') for c in convs[:self.num_conversations] if 'sample_id' in c]
Likely an incorrect or invalid review comment.

coderabbitai · 2025-12-08T23:15:50Z

requirements-dev.txt

+# Benchmark evaluation (LoCoMo official metrics)
+nltk==3.9.1


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🌐 Web query:

nltk 3.9.1 PyPI latest version

💡 Result:

No — the latest NLTK on PyPI is 3.9.2 (uploaded Oct 1, 2025). Version 3.9.1 was uploaded Aug 18, 2024. [1][2]

Sources:
[1] PyPI nltk project page (release files showing 3.9.2).
[2] PyPI mirror listing for nltk-3.9.1 (upload date Aug 18, 2024).

🌐 Web query:

NLTK Porter stemmer F1 scoring benchmark metrics

💡 Result:

Short answer: there is no single authoritative "NLTK Porter stemmer F1" value — F1 for Porter depends on the dataset and the evaluation method (what counts as a correct stem, downstream task labels, etc.). To measure it you typically (1) produce stemmed tokens for test and gold lists, (2) treat matching stems as positives and compute precision/recall/F1 (or measure downstream task performance with/without stemming). Use nltk.metrics.f_measure or scikit-learn's classification_report to compute F1. Examples and utilities: NLTK metrics docs (precision/recall/f_measure). [1] Comparative evaluations showing differing F1 results for Porter vs other stemmers depend on dataset/model (examples in published experiments). [2]

References
[1] NLTK — metrics how-to (precision/recall/f_measure usage).
[2] Example comparative evaluations reporting Porter F1s in task experiments (shows scores vary by dataset and model).

Update nltk to the latest version 3.9.2.

The pinned version nltk==3.9.1 (Aug 2024) is outdated. Version 3.9.2 was released on October 1, 2025 and is available. Update to nltk==3.9.2 for the latest features and patches.

🤖 Prompt for AI Agents

In requirements-dev.txt around lines 11 to 12, the file pins nltk to an older patch release (nltk==3.9.1); update the requirement to nltk==3.9.2 by changing the version string to the newer patch release so dependency installs use 3.9.2; run pip-compile or your dependency update workflow and run a quick test suite or import nltk locally to verify compatibility.

coderabbitai · 2025-12-08T23:15:50Z

tests/benchmarks/core_adapter/package.json

+  "version": "1.0.0",
+  "description": "Adapter to run CORE benchmark against AutoMem",
+  "type": "module",
+  "main": "index.js",


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

find tests/benchmarks/core_adapter -type f -name "*.js" | sort

Repository: verygoodplugins/automem

Length of output: 156

🏁 Script executed:

fd --type f --full-path 'tests/benchmarks/core_adapter/index\.js$'

Repository: verygoodplugins/automem

Length of output: 49

Update the main entry point.

The main field references index.js, which does not exist in this directory. The only available JavaScript files are automem-search.js and evaluate.js. Update the main field to point to the appropriate entry point, or add an index.js file that exports the necessary modules.

🤖 Prompt for AI Agents

In tests/benchmarks/core_adapter/package.json around line 6, the "main" field currently points to "index.js" which does not exist; update the "main" entry to reference an existing JS entry (for example "automem-search.js" or "evaluate.js") that should serve as the package entry point, or alternatively add a new index.js that re-exports the appropriate module(s); make the change so package.json's "main" matches an actual file and ensure any exported API matches what consumers expect.

coderabbitai · 2025-12-08T23:15:50Z

tests/benchmarks/experiments/analysis_agent.py

+
+    parser = argparse.ArgumentParser(description="AutoMem Autonomous Optimization Agent")
+    parser.add_argument("--iterations", type=int, default=10, help="Max iterations")
+    parser.add_argument("--model", type=str, default="gpt-5.1", help="LLM model for analysis")


⚠️ Potential issue | 🔴 Critical

Same invalid model in CLI default.

- parser.add_argument("--model", type=str, default="gpt-5.1", help="LLM model for analysis") + parser.add_argument("--model", type=str, default="gpt-4o", help="LLM model for analysis")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

parser.add_argument("--model", type=str, default="gpt-5.1", help="LLM model for analysis")

parser.add_argument("--model", type=str, default="gpt-4o", help="LLM model for analysis")

🤖 Prompt for AI Agents

In tests/benchmarks/experiments/analysis_agent.py at line 342, the CLI default model "gpt-5.1" is invalid; update the parser default to a supported model name (for example "gpt-4" or your project's canonical default) or read the default from the project config/ENV so it stays valid; change the argument to use that supported model string and ensure any tests or docs referencing the old default are updated accordingly.

coderabbitai · 2025-12-08T23:15:50Z

tests/benchmarks/experiments/railway_manager.py

+    async def destroy_instance(self, instance: RailwayInstance) -> bool:
+        """Tear down a Railway instance"""
+        try:
+            if instance.environment_id == "local":
+                # Docker cleanup
+                subprocess.run(
+                    ["docker", "compose", "-p", instance.project_id, "down", "-v"],
+                    capture_output=True,
+                    timeout=60
+                )
+            else:
+                # Railway cleanup
+                subprocess.run(
+                    ["railway", "delete", "--yes"],
+                    capture_output=True,
+                    timeout=60
+                )
+
+            # Find and remove from active_instances (keyed by instance_name, not project_id)
+            key_to_delete = None
+            for key, inst in self.active_instances.items():
+                if inst.project_id == instance.project_id:
+                    key_to_delete = key
+                    break
+            if key_to_delete:
+                del self.active_instances[key_to_delete]
+
+            print(f"🗑️ Destroyed instance: {instance.project_id}")
+            return True
+
+        except Exception as e:
+            print(f"❌ Cleanup error: {e}")
+            return False


⚠️ Potential issue | 🟠 Major

Railway cleanup doesn't specify project context.

The railway delete command at line 224 doesn't specify which project to delete. This could delete the wrong project if the CLI context changed.

else: # Railway cleanup subprocess.run( - ["railway", "delete", "--yes"], + ["railway", "delete", "--yes", "-p", instance.project_id], capture_output=True, timeout=60 )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

async def destroy_instance(self, instance: RailwayInstance) -> bool:

"""Tear down a Railway instance"""

try:

if instance.environment_id == "local":

# Docker cleanup

subprocess.run(

["docker", "compose", "-p", instance.project_id, "down", "-v"],

capture_output=True,

timeout=60

)

else:

# Railway cleanup

subprocess.run(

["railway", "delete", "--yes"],

capture_output=True,

timeout=60

)

# Find and remove from active_instances (keyed by instance_name, not project_id)

key_to_delete = None

for key, inst in self.active_instances.items():

if inst.project_id == instance.project_id:

key_to_delete = key

break

if key_to_delete:

del self.active_instances[key_to_delete]

print(f"🗑️ Destroyed instance: {instance.project_id}")

return True

except Exception as e:

print(f"❌ Cleanup error: {e}")

return False

async def destroy_instance(self, instance: RailwayInstance) -> bool:

"""Tear down a Railway instance"""

try:

if instance.environment_id == "local":

# Docker cleanup

subprocess.run(

["docker", "compose", "-p", instance.project_id, "down", "-v"],

capture_output=True,

timeout=60

)

else:

# Railway cleanup

subprocess.run(

["railway", "delete", "--yes", "-p", instance.project_id],

capture_output=True,

timeout=60

)

# Find and remove from active_instances (keyed by instance_name, not project_id)

key_to_delete = None

for key, inst in self.active_instances.items():

if inst.project_id == instance.project_id:

key_to_delete = key

break

if key_to_delete:

del self.active_instances[key_to_delete]

print(f"🗑️ Destroyed instance: {instance.project_id}")

return True

except Exception as e:

print(f"❌ Cleanup error: {e}")

return False

🧰 Tools

🪛 Ruff (0.14.7)

216-216: subprocess call: check for execution of untrusted input

(S603)

217-217: Starting a process with a partial executable path

(S607)

224-224: Starting a process with a partial executable path

(S607)

239-239: Consider moving this statement to an else block

(TRY300)

241-241: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents

In tests/benchmarks/experiments/railway_manager.py around lines 211 to 243, the call to `railway delete` does not specify which project to delete, risking deleting the wrong project; update the subprocess call to pass the target project explicitly (e.g., add the appropriate project flag/argument using instance.project_id such as "--project" or equivalent supported by the Railway CLI) so the command runs like: ["railway", "delete", "--yes", "--project", instance.project_id] (or the CLI's correct project option), keep capture_output and timeout, and ensure the rest of the cleanup code still removes the instance from active_instances and logs the result.

coderabbitai · 2025-12-08T23:15:50Z

tests/benchmarks/experiments/railway_manager.py

+    async def _get_deployment_url(self, project_id: str) -> str:
+        """Get the public URL for a Railway deployment"""
+        result = subprocess.run(
+            ["railway", "domain"],
+            capture_output=True,
+            text=True
+        )
+        # Parse URL from output
+        for line in result.stdout.split("\n"):
+            if "railway.app" in line or "up.railway.app" in line:
+                return f"https://{line.strip()}"
+        return ""


⚠️ Potential issue | 🟡 Minor

Unused project_id parameter in _get_deployment_url.

The method signature accepts project_id but doesn't use it. The railway domain command should probably be scoped to the project.

Either use the parameter or remove it:

- async def _get_deployment_url(self, project_id: str) -> str: + async def _get_deployment_url(self, _project_id: str) -> str: """Get the public URL for a Railway deployment"""

Or scope the command:

result = subprocess.run( - ["railway", "domain"], + ["railway", "domain", "-p", project_id], capture_output=True, text=True )

🧰 Tools

🪛 Ruff (0.14.7)

271-271: Unused method argument: project_id

(ARG002)

274-274: Starting a process with a partial executable path

(S607)

🤖 Prompt for AI Agents

In tests/benchmarks/experiments/railway_manager.py around lines 271-282, the _get_deployment_url function declares a project_id parameter but never uses it; update the subprocess.run invocation to scope the railway command to the provided project (for example add the project flag like ["railway", "domain", "--project", project_id]) so the domain lookup targets the correct project, handle the case where project_id is empty or None by raising/returning early, and keep the existing stdout parsing and return behavior unchanged.

coderabbitai · 2025-12-08T23:15:50Z

tests/benchmarks/experiments/run_optimization.sh

+    if [ -f "$OUTPUT_DIR/"*"_report.txt" ]; then
+        echo ""
+        echo -e "${BLUE}=== SUMMARY ===${NC}"
+        cat "$OUTPUT_DIR/"*"_report.txt" | head -30
+    fi


⚠️ Potential issue | 🟡 Minor

Fix the file existence check with glob pattern.

The -f test doesn't work correctly with glob patterns. Use a for loop or other approach to check for matching files.

Apply this diff to properly check for report files:

# Show summary - if [ -f "$OUTPUT_DIR/"*"_report.txt" ]; then + report_file=$(find "$OUTPUT_DIR" -name "*_report.txt" -type f | head -1) + if [ -n "$report_file" ]; then echo "" echo -e "${BLUE}=== SUMMARY ===${NC}" - cat "$OUTPUT_DIR/"*"_report.txt" | head -30 + head -30 "$report_file" fi

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools

🪛 Shellcheck (0.11.0)

[error] 143-143: -f doesn't work with globs. Use a for loop.

(SC2144)

🤖 Prompt for AI Agents

In tests/benchmarks/experiments/run_optimization.sh around lines 143-147, the current if [ -f "$OUTPUT_DIR/"*"_report.txt" ] check incorrectly tests a glob pattern; replace it with logic that expands the glob and verifies an actual file exists (e.g., iterate over "$OUTPUT_DIR/"*"_report.txt" and on the first existing file run the summary block, or use a glob-existence test such as compgen -G to detect matches), then cat the matched file(s) as before; ensure the loop or test short-circuits after finding no matches to preserve original behavior.

coderabbitai · 2025-12-08T23:15:50Z

tests/benchmarks/experiments/test_e2e_mini.py

+        resp = requests.get(
+            f"{BASE_URL}/recall",
+            headers=headers,
+            params={"query": question, "tags": test_tag, "limit": 5}
+        )


⚠️ Potential issue | 🟡 Minor

Add timeout to requests.get call.

The recall request is missing a timeout, which could cause the test to hang indefinitely.

resp = requests.get( f"{BASE_URL}/recall", headers=headers, - params={"query": question, "tags": test_tag, "limit": 5} + params={"query": question, "tags": test_tag, "limit": 5}, + timeout=10 )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

resp = requests.get(

f"{BASE_URL}/recall",

headers=headers,

params={"query": question, "tags": test_tag, "limit": 5}

)

resp = requests.get(

f"{BASE_URL}/recall",

headers=headers,

params={"query": question, "tags": test_tag, "limit": 5},

timeout=10

)

🧰 Tools

🪛 Ruff (0.14.7)

106-106: Probable use of requests call without timeout

(S113)

🤖 Prompt for AI Agents

In tests/benchmarks/experiments/test_e2e_mini.py around lines 106 to 110, the requests.get call for the /recall endpoint lacks a timeout which can cause the test to hang; add a timeout parameter (e.g. timeout=10) to the requests.get call (or use a project-configured constant) so the request fails fast on network issues, and adjust any test expectations or exception handling if necessary to accommodate timeout exceptions.

coderabbitai · 2025-12-08T23:15:50Z

tests/benchmarks/experiments/test_e2e_mini.py

+    resp = requests.delete(
+        f"{BASE_URL}/admin/memories",
+        headers={"Authorization": f"Bearer admin"},
+        params={"tag": test_tag, "confirm": "yes"}
+    )
+    print(f"  Cleanup: {resp.status_code}")


⚠️ Potential issue | 🟡 Minor

Add timeout to cleanup request and verify auth token usage.

The delete request is missing a timeout. Also, the cleanup uses a hardcoded "admin" token instead of API_TOKEN - verify this is intentional for the admin endpoint.

resp = requests.delete( f"{BASE_URL}/admin/memories", headers={"Authorization": f"Bearer admin"}, - params={"tag": test_tag, "confirm": "yes"} + params={"tag": test_tag, "confirm": "yes"}, + timeout=10 )

🧰 Tools

🪛 Ruff (0.14.7)

159-159: Probable use of requests call without timeout

(S113)

161-161: f-string without any placeholders

Remove extraneous f prefix

(F541)

🤖 Prompt for AI Agents

In tests/benchmarks/experiments/test_e2e_mini.py around lines 159 to 164, the cleanup DELETE call is missing a timeout and uses a hardcoded "admin" token; update the requests.delete invocation to include a timeout (e.g., timeout=10) and replace the hardcoded Authorization value with the shared API_TOKEN variable (headers={"Authorization": f"Bearer {API_TOKEN}"}) unless the test truly requires a separate admin token—if a dedicated admin token is required, document that intent and reference a named ADMIN_TOKEN constant instead of the literal "admin".

coderabbitai · 2025-12-08T23:15:50Z

tests/benchmarks/locomo_metrics.py

+    # Category 3 (multi-hop): Use first answer only (before semicolon)
+    if category == 3:
+        ground_truth = ground_truth.split(";")[0].strip()
+
+    # Category 1 (single-hop with multiple sub-answers): Use multi-answer F1
+    if category == 1:
+        score = f1_multi_answer(prediction, ground_truth)
+        return score, "f1_multi_answer"


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

cd tests/benchmarks && cat -n locomo_metrics.py | head -200

Repository: verygoodplugins/automem

Length of output: 7344

🏁 Script executed:

# Search for the function containing this code and its docstring rg -n "def.*:" tests/benchmarks/locomo_metrics.py -A 50 | head -150

Repository: verygoodplugins/automem

Length of output: 5148

🏁 Script executed:

# Look for category definitions or documentation rg -n "Category|category" tests/benchmarks/locomo_metrics.py -B 2 -A 2

Repository: verygoodplugins/automem

Length of output: 4474

🌐 Web query:

LoCoMo paper ACL 2024 category definitions single-hop multi-hop

💡 Result:

Single-hop: question answerable using information contained in a single session. [1]

Multi-hop: question requiring synthesis of information from multiple different sessions. [1]

(These definitions are given in LoCoMo Section 4.1; see the LoCoMo repo/paper release and accompanying description.) [1][2]

Sources:
[1] snap-research/locomo (ACL 2024 LoCoMo repo & paper).
[2] MemMachine summary of LoCoMo category descriptions.

🏁 Script executed:

# Check if there are any tests or other files that clarify category definitions rg -n "category.*1|category.*3" tests/ --type py -i | grep -E "(single|multi|hop|recall)" | head -20

Repository: verygoodplugins/automem

Length of output: 1347

🏁 Script executed:

# Look for any documentation or README files that explain the categories find . -name "README*" -o -name "*.md" | xargs grep -l "category\|locomo" 2>/dev/null | head -5

Repository: verygoodplugins/automem

Length of output: 166

Fix incorrect docstring for f1_multi_answer function.

The docstring at line 94 incorrectly describes f1_multi_answer as handling "multi-hop questions (category 1)". Category 1 is actually single-hop recall with multiple sub-answers (comma-separated), while Category 3 is multi-hop (semicolon-separated). The docstring should read: "Official LoCoMo F1 for single-hop questions with multiple sub-answers (category 1)."

🤖 Prompt for AI Agents

In tests/benchmarks/locomo_metrics.py around line 94, the docstring for the f1_multi_answer function incorrectly says it handles "multi-hop questions (category 1)"; update the docstring to read: "Official LoCoMo F1 for single-hop questions with multiple sub-answers (category 1)." Replace the erroneous phrase so the docstring correctly describes category 1 as single-hop recall with comma-separated sub-answers.

coderabbitai · 2025-12-08T23:15:50Z

tests/benchmarks/REPRODUCIBILITY.md

+# Clone repository
+git clone https://github.com/your-org/automem.git
+cd automem


⚠️ Potential issue | 🟡 Minor

Update placeholder repository URL.

The git clone URL uses a placeholder that should be updated to the actual repository URL.

# Clone repository -git clone https://github.com/your-org/automem.git +git clone https://github.com/verygoodplugins/automem.git cd automem

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Clone repository

git clone https://github.com/your-org/automem.git

cd automem

# Clone repository

git clone https://github.com/verygoodplugins/automem.git

cd automem

🤖 Prompt for AI Agents

In tests/benchmarks/REPRODUCIBILITY.md around lines 20–22 the git clone command uses a placeholder URL (https://github.com/your-org/automem.git); replace it with the actual repository clone URL for this project (e.g. the real GitHub org/user and repo name or the SSH clone URL) so the command works for contributors, and verify the URL is reachable and correct before committing.

- Add comprehensive LoCoMo test harness with multiple evaluation modes - Add CORE adapter for direct comparison with CORE benchmark - Add experiment runner framework for parameter optimization - Add analysis agent for automated result interpretation - Add benchmark documentation (analysis, competitor comparison, reproducibility) - Add convenience scripts for running benchmarks This provides a complete benchmarking infrastructure for validating AutoMem's performance against academic baselines.

coderabbitai bot added the enhancement New feature or request label Dec 8, 2025

coderabbitai bot reviewed Dec 8, 2025

View reviewed changes

jack-arturo force-pushed the cleanup/benchmark branch from bf2fa16 to a7fd3f1 Compare December 9, 2025 02:44

	parser.add_argument("--model", type=str, default="gpt-5.1", help="LLM model for analysis")
	parser.add_argument("--model", type=str, default="gpt-4o", help="LLM model for analysis")

Cleanup/benchmark #20

Are you sure you want to change the base?

Cleanup/benchmark #20

Uh oh!

Conversation

jack-arturo commented Dec 8, 2025

Uh oh!

coderabbitai bot commented Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Sequence Diagrams

Estimated code review effort

Possibly related PRs

Suggested labels

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai bot commented Dec 8, 2025 •

edited

Loading