-
Notifications
You must be signed in to change notification settings - Fork 9
Cleanup/benchmark #20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the 📝 WalkthroughWalkthroughThis PR adds a comprehensive benchmarking and experiment framework for AutoMem's LoCoMo evaluation. It introduces official LoCoMo metrics, a CORE-compatible evaluation adapter, an experiment configuration system with Railway-based deployment, an autonomous analysis agent for optimization, and various benchmark utilities. The framework supports multiple evaluation modes including end-to-end generation and lenient semantic scoring. Changes
Sequence DiagramssequenceDiagram
actor User
participant Script as run_all_benchmarks.sh
participant Python as test_locomo.py
participant AutoMem
participant OpenAI
participant Metrics as locomo_metrics.py
User->>Script: Execute benchmark
Script->>Script: Verify AutoMem healthy
Script->>Python: Run default mode
loop Per Conversation/Question
Python->>AutoMem: POST /memory (store)
Python->>AutoMem: GET /recall (retrieve)
Python->>OpenAI: Generate answer (e2e mode)
Python->>Metrics: Score answer (official F1)
end
Python->>Script: Return results (JSON)
Script->>Script: Display summary
sequenceDiagram
actor User
participant Runner as ExperimentRunner
participant Config as ExperimentConfig
participant Railway as RailwayManager
participant Evaluator as LoCoMoEvaluator
participant Agent as AnalysisAgent
User->>Runner: run_grid_search(configs)
loop Per Config
Runner->>Railway: deploy_instance(config)
Railway->>Railway: Wait for healthy
Runner->>Evaluator: run_single_experiment()
Evaluator->>Evaluator: Evaluate & score
Runner->>Runner: Save result
end
Runner->>Agent: analyze_and_decide(results)
Agent->>Agent: Format results & LLM prompt
Agent->>Agent: Parse decision (continue/stop)
alt Continue
Agent->>Config: Generate next configs
Agent->>Runner: run_grid_search(next_configs)
else Stop/Converged
Agent->>Agent: Save final report
end
sequenceDiagram
participant Client as CORE Evaluator
participant Adapter as automem-search.js
participant AutoMem
participant OpenAI as OpenAI API
participant Metrics as evaluate.js
Client->>Adapter: search(query, userId)
Adapter->>AutoMem: GET /recall
AutoMem->>Adapter: Return memories
Adapter->>Adapter: Format to episodes/facts
Adapter->>Client: Return results
Client->>OpenAI: Generate answer (with context)
OpenAI->>Client: Return answer
Client->>Metrics: Evaluate answer
Metrics->>Metrics: Compute correctness label
Metrics->>Client: Return CORRECT/WRONG
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Areas requiring extra attention:
Possibly related PRs
Suggested labels
Pre-merge checks and finishing touches❌ Failed checks (1 inconclusive)
✅ Passed checks (2 passed)
Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 10
🧹 Nitpick comments (36)
tests/test_embedding_providers.py (1)
545-554: Tighten env isolation to always exercise the auto‑selection fallback pathThe new block correctly ensures OpenAI is skipped and FastEmbed fails so that placeholder is selected. However, the test still depends on the outer environment’s
EMBEDDING_PROVIDERvalue: if it is set to"openai"or"local",init_embedding_providerwill raise instead of falling back to the placeholder.To make the test fully deterministic and aligned with the documented provider selection behavior, explicitly force the auto path here, e.g.:
with patch.dict( os.environ, {"EMBEDDING_PROVIDER": "auto", "OPENAI_API_KEY": ""}, clear=False, ): with patch("automem.embedding.fastembed.TextEmbedding", side_effect=failing_init): ...Optionally, since you no longer use the
monkeypatchfixture in this test, you can drop it from the function signature or switch tomonkeypatch.setenvfor consistency with other tests. Based on learnings,EMBEDDING_PROVIDERis the primary switch for provider behavior.tests/benchmarks/COMPETITOR_ANALYSIS.md (1)
30-32: Consider adding language specifiers to code blocks.The ASCII diagrams would benefit from a language specifier like
textorplaintextfor consistency and to satisfy linters.-``` +```text Episode → Normalization → Extraction → Resolution → Graph Integration</blockquote></details> <details> <summary>tests/benchmarks/experiments/test_e2e_mini.py (3)</summary><blockquote> `44-44`: **Remove extraneous f-prefix from strings without placeholders.** Several print statements use f-strings without any placeholders. ```diff - print(f"\n[1/4] Storing memories...") + print("\n[1/4] Storing memories...")Similar changes needed at lines 91, 158.
100-100: Remove unused variablecategory.The
categoryvariable is assigned but never used.- category = qa.get('category', 1) + # category = qa.get('category', 1) # Available for future useOr simply remove the line if not needed.
77-78: Consider catching more specific exceptions.Broad
Exceptioncatches can mask unexpected errors. Consider catchingrequests.RequestExceptionfor HTTP-related failures.- except Exception as e: + except requests.RequestException as e: print(f" Error storing: {e}")tests/benchmarks/experiments/experiment_config.py (3)
10-10: Remove unused import.
itertoolsis imported but never used in this file.-import itertools
14-36: Consider using the defined Enums in ExperimentConfig.The
EmbeddingModel,EnrichmentModel, andRecallStrategyenums are defined but not used -ExperimentConfiguses plainstrtypes instead. Either use the enums for type safety or remove them to avoid confusion.Option 1 - Use enums (provides validation):
embedding_model: EmbeddingModel = EmbeddingModel.OPENAI_SMALLOption 2 - Keep strings but remove unused enums if they're not needed elsewhere.
126-152: Guard against missing keys in custom weights.If a caller provides partial
weightsdict,score()will raiseKeyError. Consider using.get()with defaults or validating the weights dict.def score(self, weights: Optional[Dict[str, float]] = None) -> float: - """ - Calculate weighted score for this configuration. - Higher is better. - """ - if weights is None: - weights = { + """Calculate weighted score for this configuration. Higher is better.""" + default_weights = { "accuracy": 0.4, "response_time": 0.2, - "multi_hop": 0.2, # Emphasize our weakness + "multi_hop": 0.2, "cost": 0.1, "temporal": 0.1, } + if weights is None: + weights = default_weights + else: + # Merge with defaults to handle partial weights + weights = {**default_weights, **weights}tests/benchmarks/core_adapter/README.md (1)
8-8: Format the URL as a markdown link.The bare URL should be formatted as a proper markdown link for better rendering and accessibility.
Apply this diff:
-CORE (https://github.com/RedPlanetHQ/core-benchmark) claims: +[CORE](https://github.com/RedPlanetHQ/core-benchmark) claims:tests/benchmarks/experiments/test_minimal.py (2)
1-1: Consider making the file executable or removing the shebang.The shebang is present but the file lacks execute permissions. Either add execute permissions with
chmod +xor remove the shebang if the script is not intended to be executed directly.
121-127: Unused unpacked variable.The
explanationvariable is unpacked but never used. Consider using_to indicate it's intentionally ignored.Apply this diff:
- is_correct, confidence, explanation = evaluator._evaluate_lenient_semantic( + is_correct, confidence, _ = evaluator._evaluate_lenient_semantic( question, expected, generated, category ) else: - is_correct, confidence, explanation = evaluator.check_answer_in_memories( + is_correct, confidence, _ = evaluator.check_answer_in_memories( question, expected, recalled, category )tests/benchmarks/core_adapter/evaluate.js (1)
244-244: Clarify the context slice limit.The search limit is set to 20 episodes (line 32), but only the first 10 are used for context (line 244). Consider documenting why this discrepancy exists or aligning the limits.
If the 10-episode limit is intentional for context window management, add a comment:
- const context = searchResults.episodes.slice(0, 10).join("\n\n"); + // Limit to 10 episodes for context window management (even though we retrieve 20) + const context = searchResults.episodes.slice(0, 10).join("\n\n");tests/benchmarks/core_adapter/automem-search.js (1)
43-97: Consider logging more details on search failures for debugging.The error handler at line 94 only logs
error.message, which may omit useful debugging info like HTTP status codes or response bodies. For a benchmark adapter, this could make diagnosing failures difficult.} catch (error) { - console.error("AutoMem search error:", error.message); + console.error("AutoMem search error:", error.message, error.response?.status, error.response?.data); return { episodes: [], facts: [] }; }tests/benchmarks/test_locomo.py (5)
738-767: Unused variablefound_in_memoryshould be removed.The variable
found_in_memoryis assigned at line 752 but never used. This is flagged by static analysis.if overlap_ratio > max_overlap: max_overlap = overlap_ratio - found_in_memory = mem.get("id")
1077-1085: Use explicitOptionaltype hints per PEP 484.The type hints on lines 1083-1084 use implicit
Nonedefaults withoutOptional. Per coding guidelines and PEP 484, use explicitOptional[T]orT | None.def check_answer_in_memories( self, question: str, expected_answer: Any, recalled_memories: List[Dict[str, Any]], - evidence_dialog_ids: List[str] = None, - sample_id: str = None, + evidence_dialog_ids: Optional[List[str]] = None, + sample_id: Optional[str] = None, category: int = 0, ) -> Tuple[bool, float, str]:Note: You'll need to add
Optionalto the imports fromtyping.
1438-1438: Use explicitOptionalforconversation_idsparameter.Same PEP 484 issue - use
Optional[List[str]]instead of implicitNone.- def run_benchmark(self, cleanup_after: bool = True, conversation_ids: List[str] = None) -> Dict[str, Any]: + def run_benchmark(self, cleanup_after: bool = True, conversation_ids: Optional[List[str]] = None) -> Dict[str, Any]:
699-728: Consider usingquestionparameter for improved adversarial detection.The
questionparameter is unused (static analysis ARG002). While the current implementation works, the question context could improve adversarial detection accuracy (e.g., checking if question keywords appear in memories).If intentionally unused for interface consistency, consider prefixing with underscore:
def _evaluate_adversarial( self, - question: str, + _question: str, # Reserved for future use adversarial_answer: str, recalled_memories: List[Dict[str, Any]], ) -> Tuple[bool, float, str]:
851-854: Consider moving return statement to else block for cleaner control flow.Per static analysis hint TRY300, the return at line 849 could be in an else block for clearer success/error separation.
try: response = self.openai_client.chat.completions.create( model=self.config.e2e_model, # ... ) answer = response.choices[0].message.content.strip() self.e2e_cache[cache_key] = answer - return answer - - except Exception as e: + except Exception as e: print(f"⚠️ E2E generation error: {e}") return "error generating answer" + else: + return answertests/benchmarks/locomo_metrics.py (3)
24-28: NLTK download may fail silently in CI/production environments.The
LookupErrorcatch downloadspunkt, butPorterStemmerdoesn't requirepunkt- it's part ofnltk.stem. The download might be unnecessary, and network-dependent initialization could fail in restricted environments.# Initialize stemmer -try: - ps = PorterStemmer() -except LookupError: - nltk.download("punkt") - ps = PorterStemmer() +ps = PorterStemmer() # PorterStemmer doesn't require data downloadsIf you do need NLTK data for tokenization elsewhere, consider downloading during setup/install rather than at import time.
173-208: Remove unusedquestionparameter or document its purpose.The
questionparameter is never used in the function body. Either remove it or add a leading underscore to indicate it's intentionally unused (e.g., for API consistency).def extract_answer_from_context( - memories: List[str], question: str, answer_hint: str = None + memories: List[str], _question: str, answer_hint: Optional[str] = None ) -> str:Also, use explicit
Optional[str]foranswer_hintper PEP 484.
358-364: Unused loop variablecat- use underscore prefix or use it.The loop iterates over
summary["categories"].items()but only usesdata, notcat.- for cat, data in summary["categories"].items(): + for _cat, data in summary["categories"].items(): print( f" {data['name']:35s}: "tests/benchmarks/experiments/analysis_agent.py (4)
259-265: Remove extraneousfprefix from strings without placeholders.Lines 259, 262, and 295 have f-strings without any placeholders.
- print(f"\n🤔 Analyzing results...") + print("\n🤔 Analyzing results...") decision = await self.analyze_and_decide(all_results) - print(f"\n📋 Agent Decision:") + print("\n📋 Agent Decision:") print(f" Continue: {decision.get('continue_exploring', False)}")And at line 295:
- print(f"\nFinal Config:") + print("\nFinal Config:")
354-362: Unusedbest_configvariable.The result of
run_autonomous_optimizationis assigned tobest_configbut never used afterward.- best_config = await agent.run_autonomous_optimization(runner=runner) + await agent.run_autonomous_optimization(runner=runner) print("\n✅ Optimization complete!") - print(f"Best configuration saved to: {args.output_dir}") + print(f"Best configuration saved to: {args.output_dir}/")
143-150: Broad exception catch loses error context.Catching bare
Exceptionand returning a fallback dict hides the error type. Consider logging the full traceback for debugging.+ import traceback except Exception as e: - print(f"❌ Analysis failed: {e}") + print(f"❌ Analysis failed: {e}\n{traceback.format_exc()}") return { "continue_exploring": False,
1-1: Shebang without executable permission.The file has a shebang (
#!/usr/bin/env python3) but isn't marked executable. Either remove the shebang or make the file executable.Run
chmod +x tests/benchmarks/experiments/analysis_agent.pyor remove line 1.tests/benchmarks/experiments/railway_manager.py (3)
91-92: Hardcoded/tmppath is insecure and non-portable.Using
/tmpdirectly can lead to symlink attacks or race conditions. Usetempfilemodule for secure temporary file handling.+import tempfile # In deploy_instance: - cwd="/tmp" + cwd=tempfile.gettempdir()
166-169: Usetempfile.NamedTemporaryFilefor secure env file creation.Writing to
/tmp/{instance_name}.envwith predictable names is vulnerable to race conditions.+import tempfile - env_file = f"/tmp/{instance_name}.env" - with open(env_file, "w") as f: + with tempfile.NamedTemporaryFile(mode="w", suffix=".env", delete=False) as f: + env_file = f.name for key, value in env_vars.items(): f.write(f"{key}={value}\n")
266-268: Silent exception swallowing hides health check failures.The
try-except-passpattern makes debugging impossible when health checks fail unexpectedly. At minimum, log the error.except Exception: - pass + pass # Connection refused or timeout is expected during startup await asyncio.sleep(5)Or consider logging at debug level:
+import logging +logger = logging.getLogger(__name__) except Exception as e: + logger.debug(f"Health check pending: {e}") passtests/benchmarks/experiments/experiment_runner.py (8)
110-110: Remove redundant import.The
jsonmodule is already imported at line 15. This local import on line 110 is unnecessary.- import json with open(locomo_config.data_file, 'r') as f:
292-294: Use proper dataclass copying instead of__dict__.Using
__dict__.copy()to copy the config is fragile and may not handle nested objects or future dataclass fields correctly. IfExperimentConfigis a dataclass (which seems likely from the usage), usedataclasses.replace()or proper copying.+from dataclasses import replace + # Create mutated config - config_dict = config.__dict__.copy() - config_dict["name"] = f"{config.name}_mut{i}_{param}" - config_dict[param] = value + kwargs = { + "name": f"{config.name}_mut{i}_{param}", + param: value + } # Keep weights normalized if param in ["vector_weight", "graph_weight"]: other = "graph_weight" if param == "vector_weight" else "vector_weight" - config_dict[other] = 1.0 - value + kwargs[other] = 1.0 - value - mutations.append(ExperimentConfig(**config_dict)) + mutations.append(replace(config, **kwargs))
297-299: Weight normalization logic may not preserve constraint.If the randomly selected parameter to mutate happens to be a weight parameter, the logic normalizes the complementary weight. However, this assumes vector_weight + graph_weight = 1.0. If the original config doesn't satisfy this constraint or if other code allows different totals, this could introduce inconsistency.
Consider adding an assertion or validation to ensure the weight constraint is maintained:
# Keep weights normalized if param in ["vector_weight", "graph_weight"]: other = "graph_weight" if param == "vector_weight" else "vector_weight" config_dict[other] = 1.0 - value + # Validate constraint + assert abs(config_dict["vector_weight"] + config_dict["graph_weight"] - 1.0) < 0.001
339-339: Remove unused variableworst.The variable
worstis assigned but never used in the report generation.sorted_results = sorted(self.results, key=lambda r: r.score(), reverse=True) best = sorted_results[0] - worst = sorted_results[-1] baseline = next((r for r in self.results if r.config.name == "baseline"), sorted_results[-1])
382-382: Remove unnecessary f-string prefixes.Several f-strings don't contain any placeholders and should use regular strings instead.
- report.append(f"\n2. Multi-hop Reasoning (our weakness):") + report.append("\n2. Multi-hop Reasoning (our weakness):") - report.append(f"\n3. Temporal Understanding:") + report.append("\n3. Temporal Understanding:") - report.append(f"\n4. Response Times:") + report.append("\n4. Response Times:") - report.append(f"\nEmbedding Model Impact:") + report.append("\nEmbedding Model Impact:") - report.append(f"\nFact Extraction Impact:") + report.append("\nFact Extraction Impact:")Also applies to: 392-392, 399-399, 414-414, 425-425
631-631: Remove unnecessary f-string prefix.The f-string on line 631 doesn't contain any placeholders.
- print(f"\n🎯 Optimal configuration found:") + print("\n🎯 Optimal configuration found:")
649-649: Remove or use theresultsvariable in ablation mode.In ablation mode, the
resultsvariable is assigned but never used. Either use it to generate specific ablation insights or remove the assignment.- results = await runner.run_ablation_study(args.param, values) + await runner.run_ablation_study(args.param, values)Or if you want to use the results:
results = await runner.run_ablation_study(args.param, values) + # Ablation-specific analysis could go here + print(f"\n📊 Ablation study completed with {len(results)} variants")
655-662: Railway cleanup logic is correct but could be clarified.The Railway cleanup appears in both the
except KeyboardInterruptblock and thefinallyblock. While this ensures cleanup happens regardless of how the program exits, callingdestroy_all()twice could be wasteful. Consider using a flag to track whether cleanup has already occurred, or rely solely on thefinallyblock.except KeyboardInterrupt: print("\n⚠️ Interrupted! Cleaning up...") - if args.railway: - await runner.instance_manager.destroy_all() finally: if args.railway: await runner.instance_manager.destroy_all()The
finallyblock will execute even after theexceptblock, so removing the duplicate call from the except handler should be sufficient. The print statement can remain to inform the user.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
tests/benchmarks/core_adapter/package-lock.jsonis excluded by!**/package-lock.json
📒 Files selected for processing (24)
.gitignore(1 hunks)requirements-dev.txt(1 hunks)tests/benchmarks/BENCHMARK_ANALYSIS_2025-12-03.md(1 hunks)tests/benchmarks/COMPETITOR_ANALYSIS.md(1 hunks)tests/benchmarks/REPRODUCIBILITY.md(1 hunks)tests/benchmarks/core_adapter/README.md(1 hunks)tests/benchmarks/core_adapter/automem-search.js(1 hunks)tests/benchmarks/core_adapter/evaluate.js(1 hunks)tests/benchmarks/core_adapter/package.json(1 hunks)tests/benchmarks/experiments/README.md(1 hunks)tests/benchmarks/experiments/__init__.py(1 hunks)tests/benchmarks/experiments/analysis_agent.py(1 hunks)tests/benchmarks/experiments/experiment_config.py(1 hunks)tests/benchmarks/experiments/experiment_runner.py(1 hunks)tests/benchmarks/experiments/railway_manager.py(1 hunks)tests/benchmarks/experiments/run_optimization.sh(1 hunks)tests/benchmarks/experiments/test_e2e_mini.py(1 hunks)tests/benchmarks/experiments/test_minimal.py(1 hunks)tests/benchmarks/locomo_metrics.py(1 hunks)tests/benchmarks/run_all_benchmarks.sh(1 hunks)tests/benchmarks/run_lenient_benchmark.sh(1 hunks)tests/benchmarks/test_locomo.py(16 hunks)tests/test_api_endpoints.py(5 hunks)tests/test_embedding_providers.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
tests/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Use pytest with DummyGraph fixture to mock FalkorDB operations for unit testing
Files:
tests/benchmarks/experiments/test_e2e_mini.pytests/test_api_endpoints.pytests/test_embedding_providers.pytests/benchmarks/experiments/__init__.pytests/benchmarks/experiments/test_minimal.pytests/benchmarks/experiments/analysis_agent.pytests/benchmarks/locomo_metrics.pytests/benchmarks/experiments/experiment_config.pytests/benchmarks/experiments/railway_manager.pytests/benchmarks/test_locomo.pytests/benchmarks/experiments/experiment_runner.py
**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
**/*.py: Use black for Python code formatting
Use flake8 for Python linting
**/*.py: Use Python with type hints. Indent 4 spaces; line length 100 (Black).
Use snake_case for module and function names in Python.
Use PascalCase for class names in Python.
Use UPPER_SNAKE_CASE for constants in Python.
Format code with Black (line length 100) and organize imports with Isort (profile=black).
Lint with Flake8 and ensure CI passes before committing.
Add or adjust tests for new endpoints, stores, or utils; maintain comprehensive test coverage.
Files:
tests/benchmarks/experiments/test_e2e_mini.pytests/test_api_endpoints.pytests/test_embedding_providers.pytests/benchmarks/experiments/__init__.pytests/benchmarks/experiments/test_minimal.pytests/benchmarks/experiments/analysis_agent.pytests/benchmarks/locomo_metrics.pytests/benchmarks/experiments/experiment_config.pytests/benchmarks/experiments/railway_manager.pytests/benchmarks/test_locomo.pytests/benchmarks/experiments/experiment_runner.py
tests/test_*.py
📄 CodeRabbit inference engine (AGENTS.md)
tests/test_*.py: Place tests intests/directory namedtest_*.pyusing Pytest framework.
Prefer fixtures over globals in Pytest tests.
Files:
tests/test_api_endpoints.pytests/test_embedding_providers.py
🧠 Learnings (8)
📚 Learning: 2025-12-02T05:34:25.275Z
Learnt from: CR
Repo: verygoodplugins/automem PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-02T05:34:25.275Z
Learning: Applies to **/*.py : Add or adjust tests for new endpoints, stores, or utils; maintain comprehensive test coverage.
Applied to files:
tests/benchmarks/experiments/test_e2e_mini.pytests/benchmarks/experiments/test_minimal.py
📚 Learning: 2025-12-02T05:34:12.678Z
Learnt from: CR
Repo: verygoodplugins/automem PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-02T05:34:12.678Z
Learning: Applies to app.py : Implement 13 API endpoints covering core memory operations (POST /memory, GET /recall, PATCH /memory/<id>, DELETE /memory/<id>, GET /memory/by-tag), relationship management (POST /associate), consolidation (POST /consolidate, GET /consolidate/status), health checks (GET /health), enrichment (GET /enrichment/status, POST /enrichment/reprocess), analysis (GET /analyze), and startup recall (GET /startup-recall)
Applied to files:
tests/benchmarks/experiments/test_e2e_mini.py
📚 Learning: 2025-12-02T05:34:12.678Z
Learnt from: CR
Repo: verygoodplugins/automem PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-02T05:34:12.678Z
Learning: Applies to automem/embedding/*.py : Implement embedding generation using a provider pattern with three backends in priority order: OpenAI (if OPENAI_API_KEY set), FastEmbed (local ONNX model), and Placeholder (hash-based fallback)
Applied to files:
tests/test_api_endpoints.pytests/test_embedding_providers.py
📚 Learning: 2025-12-02T05:34:12.678Z
Learnt from: CR
Repo: verygoodplugins/automem PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-02T05:34:12.678Z
Learning: Applies to automem/embedding/*.py : Implement EmbeddingProvider abstraction with three implementations: OpenAIEmbeddingProvider, FastEmbedProvider, and PlaceholderEmbeddingProvider
Applied to files:
tests/test_embedding_providers.py
📚 Learning: 2025-12-02T05:34:12.678Z
Learnt from: CR
Repo: verygoodplugins/automem PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-02T05:34:12.678Z
Learning: Control embedding provider via EMBEDDING_PROVIDER environment variable with values: auto (default), openai, local, or placeholder
Applied to files:
tests/test_embedding_providers.py
📚 Learning: 2025-12-02T05:34:25.275Z
Learnt from: CR
Repo: verygoodplugins/automem PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-02T05:34:25.275Z
Learning: Use the Agent Memory Protocol: fetch context with recall at task start, persist key steps with POST `/memory`, tag with conventions (`agent`, `repo:<name>`, `task:<id>`, `file:<path>`, `pr:<num>`, `issue:<num>`, `result`, `decision`, `docs`), and associate related memories.
Applied to files:
tests/benchmarks/experiments/README.md
📚 Learning: 2025-12-02T05:34:25.275Z
Learnt from: CR
Repo: verygoodplugins/automem PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-02T05:34:25.275Z
Learning: Core package is located at `automem/` with notable directories: `api/` (Flask blueprints), `utils/`, `stores/`, and `config.py`.
Applied to files:
tests/benchmarks/experiments/__init__.pytests/benchmarks/core_adapter/package.json
📚 Learning: 2025-12-02T05:34:12.678Z
Learnt from: CR
Repo: verygoodplugins/automem PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-02T05:34:12.678Z
Learning: Applies to **/*enrichment*.py : Detect recurring patterns per memory type, strengthen shared Pattern nodes, and link memories via EXEMPLIFIES relationships with key terms
Applied to files:
tests/benchmarks/experiments/experiment_config.py
🧬 Code graph analysis (10)
tests/benchmarks/core_adapter/automem-search.js (1)
mcp-sse-server/server.js (1)
userId(333-333)
tests/benchmarks/experiments/test_e2e_mini.py (3)
tests/benchmarks/experiments/experiment_config.py (1)
ExperimentConfig(40-92)tests/benchmarks/locomo_metrics.py (1)
f1_score(59-89)tests/conftest.py (1)
OpenAI(128-131)
tests/benchmarks/experiments/run_optimization.sh (2)
tests/benchmarks/experiments/analysis_agent.py (1)
main(336-362)tests/benchmarks/experiments/experiment_runner.py (1)
main(514-662)
tests/benchmarks/core_adapter/evaluate.js (2)
tests/benchmarks/core_adapter/automem-search.js (1)
AutoMemSearchService(15-110)tests/conftest.py (1)
OpenAI(128-131)
tests/test_embedding_providers.py (1)
app.py (1)
init_embedding_provider(1088-1180)
tests/benchmarks/experiments/__init__.py (1)
tests/benchmarks/experiments/experiment_config.py (4)
ExperimentConfig(40-92)ExperimentResult(96-152)generate_experiment_grid(155-242)generate_focused_experiments(245-260)
tests/benchmarks/experiments/test_minimal.py (2)
tests/benchmarks/experiments/experiment_config.py (1)
ExperimentConfig(40-92)tests/benchmarks/test_locomo.py (1)
LoCoMoConfig(54-104)
tests/benchmarks/experiments/analysis_agent.py (2)
tests/benchmarks/experiments/experiment_config.py (5)
ExperimentConfig(40-92)ExperimentResult(96-152)score(126-152)generate_experiment_grid(155-242)to_json(91-92)tests/benchmarks/experiments/experiment_runner.py (3)
ExperimentRunner(47-511)run_grid_search(152-180)main(514-662)
tests/benchmarks/locomo_metrics.py (1)
tests/benchmarks/test_locomo.py (1)
normalize_answer(538-571)
tests/benchmarks/experiments/railway_manager.py (1)
tests/benchmarks/experiments/experiment_config.py (2)
ExperimentConfig(40-92)to_env_vars(70-89)
🪛 LanguageTool
tests/benchmarks/experiments/README.md
[grammar] ~160-~160: Ensure spelling is correct
Context: ...Monitor response times - stay under 100ms for production 4. Watch token costs...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
tests/benchmarks/COMPETITOR_ANALYSIS.md
[style] ~189-~189: To elevate your writing, try using an alternative expression here.
Context: ...ompetitive advantage. --- ## Why This Matters for LoCoMo Categories | Category | Gap...
(MATTERS_RELEVANT)
🪛 markdownlint-cli2 (0.18.1)
tests/benchmarks/core_adapter/README.md
8-8: Bare URL used
(MD034, no-bare-urls)
tests/benchmarks/experiments/README.md
8-8: Bare URL used
(MD034, no-bare-urls)
tests/benchmarks/COMPETITOR_ANALYSIS.md
32-32: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
105-105: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
162-162: Emphasis used instead of a heading
(MD036, no-emphasis-as-heading)
🪛 Ruff (0.14.7)
tests/benchmarks/experiments/test_e2e_mini.py
1-1: Shebang is present but file is not executable
(EXE001)
44-44: f-string without any placeholders
Remove extraneous f prefix
(F541)
77-77: Do not catch blind exception: Exception
(BLE001)
91-91: f-string without any placeholders
Remove extraneous f prefix
(F541)
100-100: Local variable category is assigned to but never used
Remove assignment to unused variable category
(F841)
106-106: Probable use of requests call without timeout
(S113)
137-137: Do not catch blind exception: Exception
(BLE001)
158-158: f-string without any placeholders
Remove extraneous f prefix
(F541)
159-159: Probable use of requests call without timeout
(S113)
161-161: f-string without any placeholders
Remove extraneous f prefix
(F541)
tests/benchmarks/experiments/test_minimal.py
1-1: Shebang is present but file is not executable
(EXE001)
34-34: f-string without any placeholders
Remove extraneous f prefix
(F541)
50-50: f-string without any placeholders
Remove extraneous f prefix
(F541)
58-58: f-string without any placeholders
Remove extraneous f prefix
(F541)
63-63: Prefer next(iter(evaluator.conversations.keys())) over single element slice
Replace with next(iter(evaluator.conversations.keys()))
(RUF015)
68-68: f-string without any placeholders
Remove extraneous f prefix
(F541)
73-73: f-string without any placeholders
Remove extraneous f prefix
(F541)
78-78: f-string without any placeholders
Remove extraneous f prefix
(F541)
87-87: f-string without any placeholders
Remove extraneous f prefix
(F541)
91-91: f-string without any placeholders
Remove extraneous f prefix
(F541)
125-125: Unpacked variable explanation is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
138-138: f-string without any placeholders
Remove extraneous f prefix
(F541)
tests/benchmarks/experiments/analysis_agent.py
1-1: Shebang is present but file is not executable
(EXE001)
141-141: Consider moving this statement to an else block
(TRY300)
143-143: Do not catch blind exception: Exception
(BLE001)
259-259: f-string without any placeholders
Remove extraneous f prefix
(F541)
262-262: f-string without any placeholders
Remove extraneous f prefix
(F541)
295-295: f-string without any placeholders
Remove extraneous f prefix
(F541)
359-359: Local variable best_config is assigned to but never used
Remove assignment to unused variable best_config
(F841)
tests/benchmarks/locomo_metrics.py
174-174: Unused function argument: question
(ARG001)
174-174: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
358-358: Loop control variable cat not used within loop body
Rename unused cat to _cat
(B007)
tests/benchmarks/experiments/railway_manager.py
55-55: Starting a process with a partial executable path
(S607)
60-60: Consider moving this statement to an else block
(TRY300)
61-61: Do not catch blind exception: Exception
(BLE001)
82-82: subprocess call: check for execution of untrusted input
(S603)
83-87: Starting a process with a partial executable path
(S607)
91-91: Probable insecure usage of temporary file or directory: "/tmp"
(S108)
104-104: subprocess call: check for execution of untrusted input
(S603)
105-105: Starting a process with a partial executable path
(S607)
112-112: Starting a process with a partial executable path
(S607)
141-141: Consider moving this statement to an else block
(TRY300)
143-143: Do not catch blind exception: Exception
(BLE001)
166-166: Probable insecure usage of temporary file or directory: "/tmp/"
(S108)
173-173: subprocess call: check for execution of untrusted input
(S603)
174-180: Starting a process with a partial executable path
(S607)
205-205: Consider moving this statement to an else block
(TRY300)
207-207: Do not catch blind exception: Exception
(BLE001)
216-216: subprocess call: check for execution of untrusted input
(S603)
217-217: Starting a process with a partial executable path
(S607)
224-224: Starting a process with a partial executable path
(S607)
239-239: Consider moving this statement to an else block
(TRY300)
241-241: Do not catch blind exception: Exception
(BLE001)
266-267: try-except-pass detected, consider logging the exception
(S110)
266-266: Do not catch blind exception: Exception
(BLE001)
271-271: Unused method argument: project_id
(ARG002)
274-274: Starting a process with a partial executable path
(S607)
325-325: Consider moving this statement to an else block
(TRY300)
326-326: Do not catch blind exception: Exception
(BLE001)
340-340: Consider moving this statement to an else block
(TRY300)
341-341: Do not catch blind exception: Exception
(BLE001)
tests/benchmarks/test_locomo.py
701-701: Unused method argument: question
(ARG002)
752-752: Local variable found_in_memory is assigned to but never used
Remove assignment to unused variable found_in_memory
(F841)
849-849: Consider moving this statement to an else block
(TRY300)
851-851: Do not catch blind exception: Exception
(BLE001)
935-935: Consider moving this statement to an else block
(TRY300)
937-937: Do not catch blind exception: Exception
(BLE001)
1083-1083: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
1084-1084: Unused method argument: category
(ARG002)
1438-1438: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
tests/benchmarks/experiments/experiment_runner.py
1-1: Shebang is present but file is not executable
(EXE001)
137-137: Do not catch blind exception: Exception
(BLE001)
288-288: Standard pseudo-random generators are not suitable for cryptographic purposes
(S311)
289-289: Standard pseudo-random generators are not suitable for cryptographic purposes
(S311)
339-339: Local variable worst is assigned to but never used
Remove assignment to unused variable worst
(F841)
382-382: f-string without any placeholders
Remove extraneous f prefix
(F541)
392-392: f-string without any placeholders
Remove extraneous f prefix
(F541)
399-399: f-string without any placeholders
Remove extraneous f prefix
(F541)
414-414: f-string without any placeholders
Remove extraneous f prefix
(F541)
425-425: f-string without any placeholders
Remove extraneous f prefix
(F541)
631-631: f-string without any placeholders
Remove extraneous f prefix
(F541)
649-649: Local variable results is assigned to but never used
Remove assignment to unused variable results
(F841)
🪛 Shellcheck (0.11.0)
tests/benchmarks/experiments/run_optimization.sh
[error] 143-143: -f doesn't work with globs. Use a for loop.
(SC2144)
🔇 Additional comments (18)
.gitignore (1)
25-26: LGTM!The ignore pattern correctly prevents committing temporary experiment results generated by the benchmark suite, and the comment clarifies the intended workflow. The pattern syntax is consistent with existing rules in the file (e.g., line 17).
tests/test_api_endpoints.py (2)
16-16: LGTM: Good use of configuration constant.Using
config.VECTOR_SIZEinstead of a hardcoded value improves maintainability and ensures the test stays synchronized with the actual configuration.Also applies to: 918-918
619-637: LGTM: Test correctly validates OpenAI requirement for reembed.The updated test expectations properly reflect that the reembed operation requires OpenAI to be available. The 503 status code and error message validation ensure appropriate failure handling when OpenAI is unavailable.
tests/benchmarks/REPRODUCIBILITY.md (1)
1-216: Well-structured reproducibility documentation.The guide covers environment setup, multiple benchmark modes, result verification, and troubleshooting comprehensively. The citation guidance at the end is a nice touch for academic/comparative use.
tests/benchmarks/experiments/README.md (1)
1-228: Comprehensive experiment framework documentation.The README thoroughly documents grid search, AI-guided optimization, ablation studies, and Railway deployment. The ASCII diagrams and example JSON help illustrate the workflow clearly.
Note: References to "GPT-5.1" (lines 48, 110, 194) appear to be placeholders for future models - ensure these are updated when actual models are available or clarify they're hypothetical.
tests/benchmarks/experiments/__init__.py (1)
1-21: LGTM - Clean package initialization.The
__init__.pyproperly establishes the public API surface with explicit__all__declaration matching the imports.tests/benchmarks/COMPETITOR_ANALYSIS.md (1)
1-244: Excellent competitive analysis document.The analysis clearly identifies the performance gap causes (fact extraction, temporal reasoning) and provides actionable implementation priorities. The phased approach (Quick Wins → Statement Extraction → Contradiction Resolution) is pragmatic.
tests/benchmarks/experiments/experiment_config.py (2)
254-260: Potential issue with dataclass field copying.Using
base_config.__dict__.copy()works for simple dataclasses but may not handle nested objects or default_factory fields correctly. Theerrorsfield inExperimentResultusesfield(default_factory=list), so ensure this pattern isn't applied there.This is acceptable for
ExperimentConfigsince all fields are simple types. Just ensuregenerate_focused_experimentsis only used withExperimentConfig, notExperimentResult.
39-93: LGTM - Well-structured configuration dataclass.The
ExperimentConfigclass is well-designed with sensible defaults and theto_env_vars()method provides clean Railway integration.tests/benchmarks/core_adapter/evaluate.js (1)
37-38: Model name "gpt-5.1" is valid and correctly used.The default model configuration correctly references "gpt-5.1", which OpenAI released on November 12, 2025. No changes are needed.
tests/benchmarks/core_adapter/automem-search.js (2)
1-113: LGTM overall - clean adapter implementation.The adapter correctly implements the CORE-compatible interface with appropriate timeout, auth headers, and format conversion. The structure is well-documented.
76-85: Defensive check forr.memory.contentbefore accessing.Line 82 accesses
r.memory.contentwithout verifyingr.memoryexists (unlike the filter at line 78 which safely uses optional chaining). Ifr.memoryis undefined, this would throw..filter((r) => { const tags = r.memory?.tags || []; return tags.some((t) => t.startsWith("entity:")); }) .map((r) => ({ - fact: r.memory.content, - validAt: r.memory.metadata?.session_datetime || null, - source: r.memory.id, + fact: r.memory?.content || "", + validAt: r.memory?.metadata?.session_datetime || null, + source: r.memory?.id, }));Likely an incorrect or invalid review comment.
tests/benchmarks/locomo_metrics.py (1)
59-89: LGTM - F1 implementation matches official LoCoMo.The token-level F1 with Porter stemming correctly implements the official metric from the LoCoMo paper.
tests/benchmarks/experiments/railway_manager.py (2)
20-29: LGTM - RailwayInstance dataclass is well-structured.Clean dataclass definition with appropriate fields for tracking deployment state.
82-108: No subprocess security issue here—arguments are already properly passed as a list.The
subprocess.run()call uses the list form withoutshell=True, so arguments bypass shell parsing entirely. Environment variables fromconfig.to_env_vars()are safe even if they contain special characters, since["railway", "variables", "set", f"{key}={value}"]passes the concatenated string directly to the Railway CLI without shell interpretation.tests/benchmarks/test_locomo.py (1)
1656-1676: Same invalid model names in CLI defaults.Lines 1659 and 1675 reference
gpt-5.1which doesn't exist.parser.add_argument( "--e2e-model", - default="gpt-5.1", - help="Model for E2E answer generation (default: gpt-5.1, options: gpt-4.1, gpt-4o-mini)", + default="gpt-4o", + help="Model for E2E answer generation (default: gpt-4o, options: gpt-4-turbo, gpt-4o-mini)", ) # ... parser.add_argument( "--eval-judge-model", - default="gpt-5.1", - help="Model for lenient evaluation judging (default: gpt-5.1)", + default="gpt-4o", + help="Model for lenient evaluation judging (default: gpt-4o)", )⛔ Skipped due to learnings
Learnt from: jack-arturo Repo: verygoodplugins/automem PR: 19 File: tests/benchmarks/experiments/analysis_agent.py:69-69 Timestamp: 2025-12-08T19:21:25.845Z Learning: OpenAI released GPT-5.1 on November 12, 2025. The model identifier "gpt-5.1" is valid for API usage in the automem codebase.tests/benchmarks/experiments/analysis_agent.py (1)
69-69: Invalid model name "gpt-5.1".Same issue as in test_locomo.py -
gpt-5.1doesn't exist.def __init__( self, - model: str = "gpt-5.1", + model: str = "gpt-4o", results_dir: str = "experiment_results",⛔ Skipped due to learnings
Learnt from: jack-arturo Repo: verygoodplugins/automem PR: 19 File: tests/benchmarks/experiments/analysis_agent.py:69-69 Timestamp: 2025-12-08T19:21:25.845Z Learning: OpenAI released GPT-5.1 on November 12, 2025. The model identifier "gpt-5.1" is valid for API usage in the automem codebase.tests/benchmarks/experiments/experiment_runner.py (1)
115-115: Use safe dictionary access to prevent KeyError.If a conversation in the data file doesn't have a
sample_idkey, this will raise a KeyError.- conv_ids = [c.get('sample_id') for c in convs[:self.num_conversations]] + conv_ids = [c.get('sample_id') for c in convs[:self.num_conversations] if 'sample_id' in c]Likely an incorrect or invalid review comment.
| # Benchmark evaluation (LoCoMo official metrics) | ||
| nltk==3.9.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🌐 Web query:
nltk 3.9.1 PyPI latest version
💡 Result:
No — the latest NLTK on PyPI is 3.9.2 (uploaded Oct 1, 2025). Version 3.9.1 was uploaded Aug 18, 2024. [1][2]
Sources:
[1] PyPI nltk project page (release files showing 3.9.2).
[2] PyPI mirror listing for nltk-3.9.1 (upload date Aug 18, 2024).
🌐 Web query:
NLTK Porter stemmer F1 scoring benchmark metrics
💡 Result:
Short answer: there is no single authoritative "NLTK Porter stemmer F1" value — F1 for Porter depends on the dataset and the evaluation method (what counts as a correct stem, downstream task labels, etc.). To measure it you typically (1) produce stemmed tokens for test and gold lists, (2) treat matching stems as positives and compute precision/recall/F1 (or measure downstream task performance with/without stemming). Use nltk.metrics.f_measure or scikit-learn's classification_report to compute F1. Examples and utilities: NLTK metrics docs (precision/recall/f_measure). [1] Comparative evaluations showing differing F1 results for Porter vs other stemmers depend on dataset/model (examples in published experiments). [2]
References
[1] NLTK — metrics how-to (precision/recall/f_measure usage).
[2] Example comparative evaluations reporting Porter F1s in task experiments (shows scores vary by dataset and model).
Update nltk to the latest version 3.9.2.
The pinned version nltk==3.9.1 (Aug 2024) is outdated. Version 3.9.2 was released on October 1, 2025 and is available. Update to nltk==3.9.2 for the latest features and patches.
🤖 Prompt for AI Agents
In requirements-dev.txt around lines 11 to 12, the file pins nltk to an older
patch release (nltk==3.9.1); update the requirement to nltk==3.9.2 by changing
the version string to the newer patch release so dependency installs use 3.9.2;
run pip-compile or your dependency update workflow and run a quick test suite or
import nltk locally to verify compatibility.
| "version": "1.0.0", | ||
| "description": "Adapter to run CORE benchmark against AutoMem", | ||
| "type": "module", | ||
| "main": "index.js", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
find tests/benchmarks/core_adapter -type f -name "*.js" | sortRepository: verygoodplugins/automem
Length of output: 156
🏁 Script executed:
fd --type f --full-path 'tests/benchmarks/core_adapter/index\.js$'Repository: verygoodplugins/automem
Length of output: 49
Update the main entry point.
The main field references index.js, which does not exist in this directory. The only available JavaScript files are automem-search.js and evaluate.js. Update the main field to point to the appropriate entry point, or add an index.js file that exports the necessary modules.
🤖 Prompt for AI Agents
In tests/benchmarks/core_adapter/package.json around line 6, the "main" field
currently points to "index.js" which does not exist; update the "main" entry to
reference an existing JS entry (for example "automem-search.js" or
"evaluate.js") that should serve as the package entry point, or alternatively
add a new index.js that re-exports the appropriate module(s); make the change so
package.json's "main" matches an actual file and ensure any exported API matches
what consumers expect.
|
|
||
| parser = argparse.ArgumentParser(description="AutoMem Autonomous Optimization Agent") | ||
| parser.add_argument("--iterations", type=int, default=10, help="Max iterations") | ||
| parser.add_argument("--model", type=str, default="gpt-5.1", help="LLM model for analysis") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same invalid model in CLI default.
- parser.add_argument("--model", type=str, default="gpt-5.1", help="LLM model for analysis")
+ parser.add_argument("--model", type=str, default="gpt-4o", help="LLM model for analysis")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| parser.add_argument("--model", type=str, default="gpt-5.1", help="LLM model for analysis") | |
| parser.add_argument("--model", type=str, default="gpt-4o", help="LLM model for analysis") |
🤖 Prompt for AI Agents
In tests/benchmarks/experiments/analysis_agent.py at line 342, the CLI default
model "gpt-5.1" is invalid; update the parser default to a supported model name
(for example "gpt-4" or your project's canonical default) or read the default
from the project config/ENV so it stays valid; change the argument to use that
supported model string and ensure any tests or docs referencing the old default
are updated accordingly.
| async def destroy_instance(self, instance: RailwayInstance) -> bool: | ||
| """Tear down a Railway instance""" | ||
| try: | ||
| if instance.environment_id == "local": | ||
| # Docker cleanup | ||
| subprocess.run( | ||
| ["docker", "compose", "-p", instance.project_id, "down", "-v"], | ||
| capture_output=True, | ||
| timeout=60 | ||
| ) | ||
| else: | ||
| # Railway cleanup | ||
| subprocess.run( | ||
| ["railway", "delete", "--yes"], | ||
| capture_output=True, | ||
| timeout=60 | ||
| ) | ||
|
|
||
| # Find and remove from active_instances (keyed by instance_name, not project_id) | ||
| key_to_delete = None | ||
| for key, inst in self.active_instances.items(): | ||
| if inst.project_id == instance.project_id: | ||
| key_to_delete = key | ||
| break | ||
| if key_to_delete: | ||
| del self.active_instances[key_to_delete] | ||
|
|
||
| print(f"🗑️ Destroyed instance: {instance.project_id}") | ||
| return True | ||
|
|
||
| except Exception as e: | ||
| print(f"❌ Cleanup error: {e}") | ||
| return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Railway cleanup doesn't specify project context.
The railway delete command at line 224 doesn't specify which project to delete. This could delete the wrong project if the CLI context changed.
else:
# Railway cleanup
subprocess.run(
- ["railway", "delete", "--yes"],
+ ["railway", "delete", "--yes", "-p", instance.project_id],
capture_output=True,
timeout=60
)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| async def destroy_instance(self, instance: RailwayInstance) -> bool: | |
| """Tear down a Railway instance""" | |
| try: | |
| if instance.environment_id == "local": | |
| # Docker cleanup | |
| subprocess.run( | |
| ["docker", "compose", "-p", instance.project_id, "down", "-v"], | |
| capture_output=True, | |
| timeout=60 | |
| ) | |
| else: | |
| # Railway cleanup | |
| subprocess.run( | |
| ["railway", "delete", "--yes"], | |
| capture_output=True, | |
| timeout=60 | |
| ) | |
| # Find and remove from active_instances (keyed by instance_name, not project_id) | |
| key_to_delete = None | |
| for key, inst in self.active_instances.items(): | |
| if inst.project_id == instance.project_id: | |
| key_to_delete = key | |
| break | |
| if key_to_delete: | |
| del self.active_instances[key_to_delete] | |
| print(f"🗑️ Destroyed instance: {instance.project_id}") | |
| return True | |
| except Exception as e: | |
| print(f"❌ Cleanup error: {e}") | |
| return False | |
| async def destroy_instance(self, instance: RailwayInstance) -> bool: | |
| """Tear down a Railway instance""" | |
| try: | |
| if instance.environment_id == "local": | |
| # Docker cleanup | |
| subprocess.run( | |
| ["docker", "compose", "-p", instance.project_id, "down", "-v"], | |
| capture_output=True, | |
| timeout=60 | |
| ) | |
| else: | |
| # Railway cleanup | |
| subprocess.run( | |
| ["railway", "delete", "--yes", "-p", instance.project_id], | |
| capture_output=True, | |
| timeout=60 | |
| ) | |
| # Find and remove from active_instances (keyed by instance_name, not project_id) | |
| key_to_delete = None | |
| for key, inst in self.active_instances.items(): | |
| if inst.project_id == instance.project_id: | |
| key_to_delete = key | |
| break | |
| if key_to_delete: | |
| del self.active_instances[key_to_delete] | |
| print(f"🗑️ Destroyed instance: {instance.project_id}") | |
| return True | |
| except Exception as e: | |
| print(f"❌ Cleanup error: {e}") | |
| return False |
🧰 Tools
🪛 Ruff (0.14.7)
216-216: subprocess call: check for execution of untrusted input
(S603)
217-217: Starting a process with a partial executable path
(S607)
224-224: Starting a process with a partial executable path
(S607)
239-239: Consider moving this statement to an else block
(TRY300)
241-241: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
In tests/benchmarks/experiments/railway_manager.py around lines 211 to 243, the
call to `railway delete` does not specify which project to delete, risking
deleting the wrong project; update the subprocess call to pass the target
project explicitly (e.g., add the appropriate project flag/argument using
instance.project_id such as "--project" or equivalent supported by the Railway
CLI) so the command runs like: ["railway", "delete", "--yes", "--project",
instance.project_id] (or the CLI's correct project option), keep capture_output
and timeout, and ensure the rest of the cleanup code still removes the instance
from active_instances and logs the result.
| async def _get_deployment_url(self, project_id: str) -> str: | ||
| """Get the public URL for a Railway deployment""" | ||
| result = subprocess.run( | ||
| ["railway", "domain"], | ||
| capture_output=True, | ||
| text=True | ||
| ) | ||
| # Parse URL from output | ||
| for line in result.stdout.split("\n"): | ||
| if "railway.app" in line or "up.railway.app" in line: | ||
| return f"https://{line.strip()}" | ||
| return "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unused project_id parameter in _get_deployment_url.
The method signature accepts project_id but doesn't use it. The railway domain command should probably be scoped to the project.
Either use the parameter or remove it:
- async def _get_deployment_url(self, project_id: str) -> str:
+ async def _get_deployment_url(self, _project_id: str) -> str:
"""Get the public URL for a Railway deployment"""Or scope the command:
result = subprocess.run(
- ["railway", "domain"],
+ ["railway", "domain", "-p", project_id],
capture_output=True,
text=True
)🧰 Tools
🪛 Ruff (0.14.7)
271-271: Unused method argument: project_id
(ARG002)
274-274: Starting a process with a partial executable path
(S607)
🤖 Prompt for AI Agents
In tests/benchmarks/experiments/railway_manager.py around lines 271-282, the
_get_deployment_url function declares a project_id parameter but never uses it;
update the subprocess.run invocation to scope the railway command to the
provided project (for example add the project flag like ["railway", "domain",
"--project", project_id]) so the domain lookup targets the correct project,
handle the case where project_id is empty or None by raising/returning early,
and keep the existing stdout parsing and return behavior unchanged.
| if [ -f "$OUTPUT_DIR/"*"_report.txt" ]; then | ||
| echo "" | ||
| echo -e "${BLUE}=== SUMMARY ===${NC}" | ||
| cat "$OUTPUT_DIR/"*"_report.txt" | head -30 | ||
| fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix the file existence check with glob pattern.
The -f test doesn't work correctly with glob patterns. Use a for loop or other approach to check for matching files.
Apply this diff to properly check for report files:
# Show summary
- if [ -f "$OUTPUT_DIR/"*"_report.txt" ]; then
+ report_file=$(find "$OUTPUT_DIR" -name "*_report.txt" -type f | head -1)
+ if [ -n "$report_file" ]; then
echo ""
echo -e "${BLUE}=== SUMMARY ===${NC}"
- cat "$OUTPUT_DIR/"*"_report.txt" | head -30
+ head -30 "$report_file"
fiCommittable suggestion skipped: line range outside the PR's diff.
🧰 Tools
🪛 Shellcheck (0.11.0)
[error] 143-143: -f doesn't work with globs. Use a for loop.
(SC2144)
🤖 Prompt for AI Agents
In tests/benchmarks/experiments/run_optimization.sh around lines 143-147, the
current if [ -f "$OUTPUT_DIR/"*"_report.txt" ] check incorrectly tests a glob
pattern; replace it with logic that expands the glob and verifies an actual file
exists (e.g., iterate over "$OUTPUT_DIR/"*"_report.txt" and on the first
existing file run the summary block, or use a glob-existence test such as
compgen -G to detect matches), then cat the matched file(s) as before; ensure
the loop or test short-circuits after finding no matches to preserve original
behavior.
| resp = requests.get( | ||
| f"{BASE_URL}/recall", | ||
| headers=headers, | ||
| params={"query": question, "tags": test_tag, "limit": 5} | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add timeout to requests.get call.
The recall request is missing a timeout, which could cause the test to hang indefinitely.
resp = requests.get(
f"{BASE_URL}/recall",
headers=headers,
- params={"query": question, "tags": test_tag, "limit": 5}
+ params={"query": question, "tags": test_tag, "limit": 5},
+ timeout=10
)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| resp = requests.get( | |
| f"{BASE_URL}/recall", | |
| headers=headers, | |
| params={"query": question, "tags": test_tag, "limit": 5} | |
| ) | |
| resp = requests.get( | |
| f"{BASE_URL}/recall", | |
| headers=headers, | |
| params={"query": question, "tags": test_tag, "limit": 5}, | |
| timeout=10 | |
| ) |
🧰 Tools
🪛 Ruff (0.14.7)
106-106: Probable use of requests call without timeout
(S113)
🤖 Prompt for AI Agents
In tests/benchmarks/experiments/test_e2e_mini.py around lines 106 to 110, the
requests.get call for the /recall endpoint lacks a timeout which can cause the
test to hang; add a timeout parameter (e.g. timeout=10) to the requests.get call
(or use a project-configured constant) so the request fails fast on network
issues, and adjust any test expectations or exception handling if necessary to
accommodate timeout exceptions.
| resp = requests.delete( | ||
| f"{BASE_URL}/admin/memories", | ||
| headers={"Authorization": f"Bearer admin"}, | ||
| params={"tag": test_tag, "confirm": "yes"} | ||
| ) | ||
| print(f" Cleanup: {resp.status_code}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add timeout to cleanup request and verify auth token usage.
The delete request is missing a timeout. Also, the cleanup uses a hardcoded "admin" token instead of API_TOKEN - verify this is intentional for the admin endpoint.
resp = requests.delete(
f"{BASE_URL}/admin/memories",
headers={"Authorization": f"Bearer admin"},
- params={"tag": test_tag, "confirm": "yes"}
+ params={"tag": test_tag, "confirm": "yes"},
+ timeout=10
)🧰 Tools
🪛 Ruff (0.14.7)
159-159: Probable use of requests call without timeout
(S113)
161-161: f-string without any placeholders
Remove extraneous f prefix
(F541)
🤖 Prompt for AI Agents
In tests/benchmarks/experiments/test_e2e_mini.py around lines 159 to 164, the
cleanup DELETE call is missing a timeout and uses a hardcoded "admin" token;
update the requests.delete invocation to include a timeout (e.g., timeout=10)
and replace the hardcoded Authorization value with the shared API_TOKEN variable
(headers={"Authorization": f"Bearer {API_TOKEN}"}) unless the test truly
requires a separate admin token—if a dedicated admin token is required, document
that intent and reference a named ADMIN_TOKEN constant instead of the literal
"admin".
| # Category 3 (multi-hop): Use first answer only (before semicolon) | ||
| if category == 3: | ||
| ground_truth = ground_truth.split(";")[0].strip() | ||
|
|
||
| # Category 1 (single-hop with multiple sub-answers): Use multi-answer F1 | ||
| if category == 1: | ||
| score = f1_multi_answer(prediction, ground_truth) | ||
| return score, "f1_multi_answer" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
cd tests/benchmarks && cat -n locomo_metrics.py | head -200Repository: verygoodplugins/automem
Length of output: 7344
🏁 Script executed:
# Search for the function containing this code and its docstring
rg -n "def.*:" tests/benchmarks/locomo_metrics.py -A 50 | head -150Repository: verygoodplugins/automem
Length of output: 5148
🏁 Script executed:
# Look for category definitions or documentation
rg -n "Category|category" tests/benchmarks/locomo_metrics.py -B 2 -A 2Repository: verygoodplugins/automem
Length of output: 4474
🌐 Web query:
LoCoMo paper ACL 2024 category definitions single-hop multi-hop
💡 Result:
Single-hop: question answerable using information contained in a single session. [1]
Multi-hop: question requiring synthesis of information from multiple different sessions. [1]
(These definitions are given in LoCoMo Section 4.1; see the LoCoMo repo/paper release and accompanying description.) [1][2]
Sources:
[1] snap-research/locomo (ACL 2024 LoCoMo repo & paper).
[2] MemMachine summary of LoCoMo category descriptions.
🏁 Script executed:
# Check if there are any tests or other files that clarify category definitions
rg -n "category.*1|category.*3" tests/ --type py -i | grep -E "(single|multi|hop|recall)" | head -20Repository: verygoodplugins/automem
Length of output: 1347
🏁 Script executed:
# Look for any documentation or README files that explain the categories
find . -name "README*" -o -name "*.md" | xargs grep -l "category\|locomo" 2>/dev/null | head -5Repository: verygoodplugins/automem
Length of output: 166
Fix incorrect docstring for f1_multi_answer function.
The docstring at line 94 incorrectly describes f1_multi_answer as handling "multi-hop questions (category 1)". Category 1 is actually single-hop recall with multiple sub-answers (comma-separated), while Category 3 is multi-hop (semicolon-separated). The docstring should read: "Official LoCoMo F1 for single-hop questions with multiple sub-answers (category 1)."
🤖 Prompt for AI Agents
In tests/benchmarks/locomo_metrics.py around line 94, the docstring for the
f1_multi_answer function incorrectly says it handles "multi-hop questions
(category 1)"; update the docstring to read: "Official LoCoMo F1 for single-hop
questions with multiple sub-answers (category 1)." Replace the erroneous phrase
so the docstring correctly describes category 1 as single-hop recall with
comma-separated sub-answers.
| # Clone repository | ||
| git clone https://github.com/your-org/automem.git | ||
| cd automem |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update placeholder repository URL.
The git clone URL uses a placeholder that should be updated to the actual repository URL.
# Clone repository
-git clone https://github.com/your-org/automem.git
+git clone https://github.com/verygoodplugins/automem.git
cd automem📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Clone repository | |
| git clone https://github.com/your-org/automem.git | |
| cd automem | |
| # Clone repository | |
| git clone https://github.com/verygoodplugins/automem.git | |
| cd automem |
🤖 Prompt for AI Agents
In tests/benchmarks/REPRODUCIBILITY.md around lines 20–22 the git clone command
uses a placeholder URL (https://github.com/your-org/automem.git); replace it
with the actual repository clone URL for this project (e.g. the real GitHub
org/user and repo name or the SSH clone URL) so the command works for
contributors, and verify the URL is reachable and correct before committing.
- Add comprehensive LoCoMo test harness with multiple evaluation modes - Add CORE adapter for direct comparison with CORE benchmark - Add experiment runner framework for parameter optimization - Add analysis agent for automated result interpretation - Add benchmark documentation (analysis, competitor comparison, reproducibility) - Add convenience scripts for running benchmarks This provides a complete benchmarking infrastructure for validating AutoMem's performance against academic baselines.
bf2fa16 to
a7fd3f1
Compare
This pull request introduces comprehensive benchmarking infrastructure and documentation to enable fair, apples-to-apples evaluation of AutoMem against the CORE memory system on the LoCoMo benchmark. It adds official reproducibility guides, competitor analysis, and implements a Node.js adapter for running CORE's evaluation methodology directly against AutoMem. The changes clarify the true performance gap, highlight the impact of evaluation methodology, and provide actionable recommendations for closing the gap with CORE.
Benchmarking and Evaluation Infrastructure:
REPRODUCIBILITY.md) with step-by-step instructions for running all benchmark modes, comparing results, and troubleshooting. This ensures anyone can replicate the reported numbers.BENCHMARK_ANALYSIS_2025-12-03.md) that compares AutoMem and CORE across evaluation modes, explains discrepancies, and provides category breakdowns and recommendations for reporting results.COMPETITOR_ANALYSIS.md) outlining the architectural differences between AutoMem, CORE, mem0, and Zep/Graphiti, and providing a prioritized roadmap to close the performance gap (e.g., statement extraction, temporal reasoning, contradiction detection).CORE-Compatible Evaluation Adapter:
core_adapter/automem-search.js) that mimics CORE's search interface, enabling direct use of CORE's evaluation scripts with AutoMem as the backend. This allows for direct, fair benchmarking using CORE's own code and methodology.core_adapter/README.md) describing how to use the adapter, run the evaluation, and verify results against CORE's published numbers.Dependencies:
nltk==3.9.1torequirements-dev.txtfor official F1 metric computation in benchmark scripts.These changes provide a robust foundation for transparent, reproducible benchmarking and set the stage for targeted improvements to AutoMem's retrieval and reasoning capabilities.