-
Notifications
You must be signed in to change notification settings - Fork 41
Parallelizing API calls in compare_job_performance, adding troubleshooting steps for timeouts on large applications #97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
bd181d2 to
50632c7
Compare
src/spark_history_mcp/utils/utils.py
Outdated
| elif ( | ||
| "Request failed" in error_text and "allexecutors" in error_text | ||
| ): | ||
| error_msg = f"{name} failed: Spark History Server likely out of memory - try: export SPARK_DAEMON_MEMORY=4g" | ||
| elif "Request failed" in error_text: | ||
| error_msg = f"{name} failed: Spark History Server error - may need more memory (export SPARK_DAEMON_MEMORY=4g)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these correct? Request failed" could mean a lot of things even with text allexecutors. Maybe a better approach is to handle it in else and surface the error message? Also SPARK_DAEMON_MEMORY=4g may not be a useful message for LLM?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think its too specific and the first case should be sufficient, willl remove
src/spark_history_mcp/tools/tools.py
Outdated
| } | ||
| except Exception: | ||
| return { | ||
| "error": "Spark History Server unresponsive - likely out of memory. Set SPARK_DAEMON_MEMORY=4g and restart." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can fail with any reason here so I would rather not indicate oom.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point, will make it more generic
| execution_result = parallel_execute( | ||
| api_calls, | ||
| max_workers=6, | ||
| timeout=300, # Apply generous timeout for large scale Spark applications |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
5 min enough for most use cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have tested with a two spark event logs (attached in the description), the largest one had 1,765 executors and processed 20.7TB of input data. The spark event logs itself was around 7.37 GB. 5 minutes was more than enough for this use case, but i havent tested with greater data than this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think 5 min is a good starting point
|
|
||
| **Note:** SHS_SERVERS_<Replace with server name in config.yaml>_TIMEOUT | ||
|
|
||
| ### 3: JVM Heap Exhaustion |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another way to deal with this is to use hybrid store
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the suggestion! let me test this and add it as a troubleshooting step if it works.
50632c7 to
22024d4
Compare
…oting steps for timeouts on large applications Signed-off-by: Andrew Kim <[email protected]>
22024d4 to
f37c835
Compare
nabuskey
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Signed-off-by: Andrew Kim [email protected]
🔄 Pull Request
📝 Description
Adding optimization and debugging steps for timeout scenarios when querying large spark applications. This PR will apply the optimization to compare_job_performance tool, the most complex tool we have.
Solution:
🎯 Type of Change
🧪 Testing
task test)🔬 Test Commands Run
can you compare job performance between application_a and application_b?
REvision 2
using the history store hybrid approach to resolve jvm issue, i was able to get a result for compare_job_performance after 236.157s on Q CLI
✅ Checklist
🎉 Thank you for contributing! Your effort helps make Spark monitoring more intelligent.