Skip to content

Conversation

skoob13
Copy link
Contributor

@skoob13 skoob13 commented Oct 15, 2025

Problem

The second attempt for the new loop executor with fixed issues.

Changes

  • Fixed the Inkeep node.
  • Fixed the HogQL fixer.
  • Fixed the insight search streaming.

How did you test this code?

Unit tests & manual testing

Changelog: (features only) Is this feature complete?

Max has become a more intelligent agent.

Copy link
Contributor

github-actions bot commented Oct 15, 2025

Size Change: +28 B (0%)

Total Size: 3.06 MB

ℹ️ View Unchanged
Filename Size Change
frontend/dist/toolbar.js 3.06 MB +28 B (0%)

compressed-size-action

@skoob13 skoob13 force-pushed the feat/loop-executor-revamp-attempt-2 branch from bb11ccc to 24f8a3e Compare October 16, 2025 14:24
@skoob13 skoob13 requested a review from a team October 16, 2025 14:26
@skoob13 skoob13 marked this pull request as ready for review October 16, 2025 14:26
@posthog-bot posthog-bot requested review from a team, ablaszkiewicz, daibhin, hpouillot, ioannisj, marandaneto, oliverb123 and rafaeelaudibert and removed request for a team October 16, 2025 14:28
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

82 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@kappa90 kappa90 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:yolo:

Copy link
Member

@Twixes Twixes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed one more case with ReasoningMessages now being kept after streaming completion: when generation was canceled when the last message was an ai/reasoning one, that message should say "Canceled" instead of appearing to load endlessly.
Everything working here.
Image

# Original node has Anthropic messages, but Inkeep expects OpenAI messages
langchain_messages = convert_to_messages(
convert_to_openai_messages(super()._construct_messages(messages, window_start_id, tool_calls_count))
conversation_window = self._window_manager.get_messages_in_window(messages, window_start_id)[-28:]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth adding a comments why the number 28 for other readers, while we're here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, updated

@posthog-bot
Copy link
Contributor

📸 UI snapshots have been updated

2 snapshot changes in total. 0 added, 2 modified, 0 deleted:

  • chromium: 0 added, 2 modified, 0 deleted (wasn't pushed!)
  • webkit: 0 added, 0 modified, 0 deleted

Triggered by this commit.

👉 Review this PR's diff of snapshots.

@posthog-bot
Copy link
Contributor

📸 UI snapshots have been updated

2 snapshot changes in total. 0 added, 2 modified, 0 deleted:

  • chromium: 0 added, 2 modified, 0 deleted (diff for shard 2)
  • webkit: 0 added, 0 modified, 0 deleted

Triggered by this commit.

👉 Review this PR's diff of snapshots.

@skoob13 skoob13 added the evals-ready Whether to run AI evals on this PR. label Oct 16, 2025
@posthog-bot
Copy link
Contributor

🧠 AI eval results

Evaluated 29 experiments, comprising 61 metrics.

deep_research_onboarding

🔵 covers_essential_topics: 100.00%, ±0.00% (improvements: 0, regressions: 0)
🔵 has_correct_sections: 100.00%, ±0.00% (improvements: 0, regressions: 0)
🔴 is_concise_and_focused: 33.33%, -26.67% (improvements: 0, regressions: 1)

Baseline: master-1760615533 • Avg. case performance: ⏱️ 48.91 s, 🔢 0 tokens

tool_routing_dashboard_creation

🔴 ToolRelevance: 59.10%, -17.00% (improvements: 1, regressions: 5)

Baseline: master-1760615555 • Avg. case performance: ⏱️ 8.30 s, 🔢 0 tokens

tool_call_dashboard_creation

🔵 dashboard_creation_accuracy: 40.00%, ±0.00% (improvements: 1, regressions: 1)

Baseline: master-1760615566 • Avg. case performance: ⏱️ 22.30 s, 🔢 6523 tokens, 💵 $0.0030 in tokens

funnel

🔴 plan_correctness: 82.75%, -7.00% (improvements: 3, regressions: 4)
🟢 QueryKindSelection: 100.00%, +5.26% (improvements: 1, regressions: 0)
🔴 query_and_plan_alignment: 91.11%, -1.52% (improvements: 4, regressions: 7)
🔵 time_range_relevancy: 94.44%, -0.29% (improvements: 0, regressions: 1)

Baseline: master-1760615620 • Avg. case performance: ⏱️ 62.20 s, 🔢 8709 tokens, 💵 $0.0185 in tokens

insight_evaluation_accuracy

🟢 InsightEvaluationAccuracy: 75.00%, +25.00% (improvements: 1, regressions: 0)

Baseline: master-1760615758 • Avg. case performance: ⏱️ 16.71 s, 🔢 2804 tokens, 💵 $0.0013 in tokens

memory

🔴 ToolRelevance: 68.87%, -1.07% (improvements: 2, regressions: 2)
🟢 memory_content_relevance: 74.00%, +4.00% (improvements: 3, regressions: 2)

Baseline: master-1760615771 • Avg. case performance: ⏱️ 2.43 s, 🔢 1134 tokens, 💵 $0.0032 in tokens

memory_onboarding

🔴 has_correct_style: 50.00%, -16.67% (improvements: 0, regressions: 1)
🔵 has_technical_details: 83.33%, ±0.00% (improvements: 0, regressions: 0)
🔵 satisfies_business_details: 83.33%, ±0.00% (improvements: 0, regressions: 0)
🔵 satisfies_product_details: 83.33%, ±0.00% (improvements: 0, regressions: 0)

Baseline: master-1760615784 • Avg. case performance: ⏱️ 116.52 s, 🔢 674 tokens, 💵 $0.0033 in tokens

retention

🔵 QueryKindSelection: 75.00%, ±0.00% (improvements: 0, regressions: 0)
🟢 plan_correctness: 63.00%, +63.00% (improvements: 4, regressions: 0)
🔵 query_and_plan_alignment: 92.50%, ±0.00% (improvements: 0, regressions: 0)
🔵 time_range_relevancy: 100.00%, ±0.00% (improvements: 0, regressions: 0)

Baseline: master-1760615956 • Avg. case performance: ⏱️ 62.02 s, 🔢 7693 tokens, 💵 $0.0207 in tokens

root

🔴 ToolRelevance: 69.14%, -18.11% (improvements: 4, regressions: 36)

Baseline: master-1760616013 • Avg. case performance: ⏱️ 11.23 s, 🔢 0 tokens

root_style

🔴 style_checker: 85.00%, -5.00% (improvements: 1, regressions: 2)

Baseline: master-1760616034 • Avg. case performance: ⏱️ 18.53 s, 🔢 0 tokens

tool_routing_session_replay

🔵 ToolRelevance: 94.86%, +0.14% (improvements: 4, regressions: 3)

Baseline: master-1760616083 • Avg. case performance: ⏱️ 11.32 s, 🔢 7668 tokens, 💵 $0.0469 in tokens

session_summarization_no_context

🔵 ToolRelevance: 97.99%, -0.35% (improvements: 1, regressions: 1)

Baseline: master-1760616096 • Avg. case performance: ⏱️ 8.74 s, 🔢 0 tokens

sql

🔴 QueryKindSelection: 92.31%, -7.69% (improvements: 0, regressions: 0)
🔵 plan_correctness: 76.07%, +0.36% (improvements: 5, regressions: 2)
🔴 query_and_plan_alignment: 84.23%, -7.69% (improvements: 2, regressions: 4)
🟢 retry_efficiency: 92.86%, +7.14% (improvements: 3, regressions: 1)
🟢 sql_syntax_correctness: 95.83%, +3.53% (improvements: 1, regressions: 1)
🔴 time_range_relevancy: 93.46%, -4.62% (improvements: 0, regressions: 1)

Baseline: master-1760616102 • Avg. case performance: ⏱️ 56.39 s, 🔢 10467 tokens, 💵 $0.0237 in tokens

survey_analysis

🔵 recommendation_quality: 100.00%, ±0.00% (improvements: 0, regressions: 0)
🔵 test_data_detection: 100.00%, ±0.00% (improvements: 0, regressions: 0)
🔵 theme_extraction_quality: 100.00%, ±0.00% (improvements: 0, regressions: 0)

Baseline: master-1760616241 • Avg. case performance: ⏱️ 11.28 s, 🔢 2697 tokens, 💵 $0.0088 in tokens

surveys

🔵 feature_flag_integration: 0.00%, ±0.00% (improvements: 0, regressions: 0)
🔵 feature_flag_understanding: 70.00%, ±0.00% (improvements: 0, regressions: 0)
🟢 first_question_type_correct: 40.00%, +10.00% (improvements: 1, regressions: 0)
🔴 survey_creation_basics: 85.00%, -5.00% (improvements: 0, regressions: 1)
🔴 survey_question_quality: 57.00%, -7.00% (improvements: 1, regressions: 1)
🔴 survey_relevance: 52.00%, -3.00% (improvements: 0, regressions: 1)

Baseline: master-1760616246 • Avg. case performance: ⏱️ 5.06 s, 🔢 5687 tokens, 💵 $0.0122 in tokens

trends

🟢 QueryKindSelection: 90.00%, +20.00% (improvements: 2, regressions: 0)
🟢 plan_correctness: 96.00%, +8.50% (improvements: 4, regressions: 2)
🟢 query_and_plan_alignment: 93.00%, +4.50% (improvements: 3, regressions: 3)
🟢 time_range_relevancy: 99.00%, +9.00% (improvements: 3, regressions: 0)

Baseline: master-1760616262 • Avg. case performance: ⏱️ 37.91 s, 🔢 11250 tokens, 💵 $0.0230 in tokens

ui_context_actions

🟢 ToolRelevance: 68.24%, +1.58% (improvements: 2, regressions: 0)

Baseline: master-1760616354 • Avg. case performance: ⏱️ 8.21 s, 🔢 0 tokens

ui_context_events

🔵 ToolRelevance: 66.42%, +0.81% (improvements: 3, regressions: 0)

Baseline: master-1760616360 • Avg. case performance: ⏱️ 8.59 s, 🔢 0 tokens

insights_addition

🔵 SemanticSimilarity: 78.95%, -0.67% (improvements: 0, regressions: 2)

Baseline: master-1760616367 • Avg. case performance: ⏱️ 9.22 s, 🔢 6035 tokens, 💵 $0.0028 in tokens

combined_rename_and_add

🔵 SemanticSimilarity: 89.00%, +0.04% (improvements: 1, regressions: 0)

Baseline: master-1760616411 • Avg. case performance: ⏱️ 6.15 s, 🔢 5739 tokens, 💵 $0.0025 in tokens

tool_generate_hogql_query

🔵 no_mustache: 100.00%, ±0.00% (improvements: 0, regressions: 0)
🔴 sql_semantics_correctness: 46.67%, -7.14% (improvements: 1, regressions: 2)
🔵 sql_syntax_correctness: 86.67%, ±0.00% (improvements: 0, regressions: 0)

Baseline: master-1760616421 • Avg. case performance: ⏱️ 9.93 s, 🔢 21989 tokens, 💵 $0.0447 in tokens

tool_filter_revenue_analytics

🔵 date_time_filtering_correctness: 100.00%, ±0.00% (improvements: 0, regressions: 0)
🔵 filter_generation_correctness: 100.00%, ±0.00% (improvements: 0, regressions: 0)

Baseline: master-1760616472 • Avg. case performance: ⏱️ 2.17 s, 🔢 4113 tokens, 💵 $0.0087 in tokens

tool_filter_revenue_analytics_ask_user_for_help

🔵 ask_user_for_help_scorer: 0.00%, ±0.00% (improvements: 0, regressions: 0)

Baseline: master-1760616480 • Avg. case performance: ⏱️ 2.02 s, 🔢 3577 tokens, 💵 $0.0075 in tokens

tool_search_session_recordings

🔴 date_time_filtering_correctness: 94.44%, -3.57% (improvements: 0, regressions: 1)
🔴 filter_generation_correctness: 96.03%, -2.04% (improvements: 1, regressions: 2)

Baseline: master-1760616520 • Avg. case performance: ⏱️ 7.20 s, 🔢 14987 tokens, 💵 $0.0308 in tokens

tool_search_session_recordings_ask_user_for_help

🔵 ask_user_for_help_scorer: 0.00%, ±0.00% (improvements: 0, regressions: 0)

Baseline: master-1760616536 • Avg. case performance: ⏱️ 13.60 s, 🔢 58365 tokens, 💵 $0.1180 in tokens

tool_search_session_recordings

🔵 date_time_filtering_correctness: 97.22%, ±0.00% (improvements: 0, regressions: 0)
🔵 filter_generation_correctness: 98.41%, +0.79% (improvements: 1, regressions: 0)

Baseline: master-1760616520 • Avg. case performance: ⏱️ 7.72 s, 🔢 13979 tokens, 💵 $0.0287 in tokens

tool_search_session_recordings_ask_user_for_help

🔵 ask_user_for_help_scorer: 0.00%, ±0.00% (improvements: 0, regressions: 0)

Baseline: master-1760616536 • Avg. case performance: ⏱️ 6.56 s, 🔢 27404 tokens, 💵 $0.0555 in tokens

filter_query_generation

🔵 SemanticSimilarity: 99.21%, ±0.00% (improvements: 0, regressions: 0)

Baseline: master-1760616545 • Avg. case performance: ⏱️ 0.54 s, 🔢 554 tokens, 💵 $0.0012 in tokens

yaml_fixing

🔵 ExactMatch: 83.33%, ±0.00% (improvements: 0, regressions: 0)

Baseline: master-1760616555 • Avg. case performance: ⏱️ 1.00 s, 🔢 168 tokens, 💵 $0.0001 in tokens

Triggered by this commit.

@skoob13 skoob13 merged commit bcbe71a into master Oct 16, 2025
187 of 188 checks passed
@skoob13 skoob13 deleted the feat/loop-executor-revamp-attempt-2 branch October 16, 2025 16:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

evals-ready Whether to run AI evals on this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants