-
Notifications
You must be signed in to change notification settings - Fork 42
Sift tools #74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sift tools #74
Conversation
I agree with removing CreateInvestigation, in-fact should we restructure this to have a tool to |
264870e
to
c5235db
Compare
Cool, I removed the tool. On the other comment see here |
README.md
Outdated
@@ -72,6 +79,12 @@ This is useful if you don't use certain functionality or if you don't want to ta | |||
| `get_current_oncall_users` | OnCall | Get users currently on-call for a specific schedule | | |||
| `list_oncall_teams` | OnCall | List teams from Grafana OnCall | | |||
| `list_oncall_users` | OnCall | List users from Grafana OnCall | | |||
| `create_investigation` | Sift | Create a new Sift investigation to analyze data from different datasources | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: need a docs update.
README.md
Outdated
| `get_investigation` | Sift | Retrieve an existing Sift investigation by its UUID | | ||
| `get_analysis` | Sift | Retrieve a specific analysis from a Sift investigation | | ||
| `list_investigations` | Sift | Retrieve a list of Sift investigations with an optional limit | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For get_investigation
, get_analysis
and list_investigations
- I'm not fully convinced we need to expose that level of detail to the LLM, I'd like to see examples of where we would need this exposed to the LLM and whether we can build some more abstract tools that can help.
For ex: right now, if I wanted to find an investigation with a given labelset, say {cluster='prod-us-central-0', namespace='machine-learning'}
, the LLM would need to fire 3 queries: list_investigations
-> get_investigation
-> get_analysis
(x3) to retrieve all the data. This will require several tokens to be used in LLM calls when we send the entire list of investigations to it. Instead, can we build a higher abstraction that can take in a labelset and directly retrieve analysis for it without burning through as many tokens?
Curious if there's a use-case I'm missing or any other actions that can be enabled by the current API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that's true. Adding the tools there is mainly saving time for the user in case there is anything already running in Sift.
There is an alert in namespace x --> Ask the assistant if there are any investigations for this namespace -->
List investigations
tool used --> Listing all recent investigations.
If the user wants to grab data then they can request using the ID provided.
That's the main motivation behind exposing the tools. I do agree though that we need to have maybe more targeted tools. Let's discuss in the MCP call later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a couple of comments inline. I'm not against keeping the get/list Sift tools in but I think the other high-level tools should avoid mentioning Sift.
tools/sift.go
Outdated
// RunErrorPatternLogs is a tool for running an ErrorPatternLogs check | ||
var RunErrorPatternLogs = mcpgrafana.MustTool( | ||
"run_error_pattern_logs", | ||
"Creates a Sift investigation with ErrorPatternLogs check, waits for it to complete, and returns the analysis results. This tool triggers an investigation with the ErrorPatternLogs check in the relevant Loki datasource. It investigates if the there are elevated errors rates in the logs compared to the last day's average and returns the error pattern found, if any.", | ||
runErrorPatternLogs, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My gut feeling is that these tools should be a bit less Sift-y in their names and descriptions. For example, this one could be named 'find_error_log_patterns' and the description would be closer to just the last sentence of the existing description, since an LLM is unlikely to know what Sift is, or what a ErrorPatternLogs check
means.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that's true. I will refactor it a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the "sift" mentions from error pattern logs and slow requests check.
Sift mentions in general I think are good especially if we go further and add another layer to MCP explaining what Sift is and how to use it (a prompt maybe), same with OnCall etc. I haven't removed it from the rest of the tools as they are very specific, list investigations, get investigations etc. Open to discuss this though, maybe we can completely remove Sift mentions for now 🙄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't removed it from the rest of the tools as they are very specific, list investigations, get investigations etc. Open to discuss this though, maybe we can completely remove Sift mentions for now 🙄
Yeah I think that's fine 🙂 I would even consider changing those ones to 'List Sift investigations' so it matches the docs & UI and is clearly distinct from any informal kind of 'investigation'!
tools/sift.go
Outdated
// AnalysisStep represents a single step in the analysis process. | ||
type AnalysisStep struct { | ||
CreatedAt time.Time `json:"created" validate:"isdefault"` | ||
// State that the Analysis is entering. | ||
State string `json:"state"` | ||
// The exit message of the step. Can be empty if the step was successful. | ||
ExitMessage string `json:"exitMessage"` | ||
// Runtime statistics for this step | ||
Stats map[string]interface{} `json:"stats,omitempty"` | ||
} | ||
|
||
type AnalysisEvent struct { | ||
StartTime time.Time `json:"startTime"` | ||
EndTime time.Time `json:"endTime"` | ||
Name string `json:"name"` | ||
Description string `json:"description,omitempty"` | ||
Details map[string]interface{} `json:"details"` | ||
} | ||
|
||
// Interesting: The analysis complete with results that indicate a probable cause for failure. | ||
type AnalysisResult struct { | ||
Successful bool `json:"successful"` | ||
Interesting bool `json:"interesting"` | ||
Message string `json:"message"` | ||
MarkdownSummary string `json:"-" gorm:"-"` | ||
Details map[string]interface{} `json:"details"` | ||
Events []AnalysisEvent `json:"events,omitempty" gorm:"serializer:json"` | ||
} | ||
|
||
// An Analysis struct provides the status and results | ||
// of running a specific type of check. | ||
type Analysis struct { | ||
ID uuid.UUID `json:"id" gorm:"primarykey;type:char(36)" validate:"isdefault"` | ||
CreatedAt time.Time `json:"created" validate:"isdefault"` | ||
UpdatedAt time.Time `json:"modified" validate:"isdefault"` | ||
|
||
Status AnalysisStatus `json:"status" gorm:"default:pending;index:idx_analyses_stats,priority:100"` | ||
StartedAt *time.Time `json:"started" validate:"isdefault"` | ||
|
||
// Foreign key to the Investigation that created this Analysis. | ||
InvestigationID uuid.UUID `json:"investigationId" gorm:"index:idx_analyses_stats,priority:10"` | ||
|
||
// Name is the name of the check that this analysis represents. | ||
Name string `json:"name"` | ||
Title string `json:"title"` | ||
Steps []AnalysisStep `json:"steps" gorm:"foreignKey:AnalysisID;constraint:OnDelete:CASCADE"` | ||
Result AnalysisResult `json:"result" gorm:"embedded;embeddedPrefix:result_"` | ||
} | ||
|
||
type DatasourceConfig struct { | ||
LokiDatasource DatasourceInfo `json:"lokiDatasource" gorm:"not null;embedded;embeddedPrefix:loki_"` | ||
PrometheusDatasource DatasourceInfo `json:"prometheusDatasource" gorm:"not null;embedded;embeddedPrefix:prometheus_"` | ||
TempoDatasource DatasourceInfo `json:"tempoDatasource" gorm:"not null;embedded;embeddedPrefix:tempo_"` | ||
PyroscopeDatasource DatasourceInfo `json:"pyroscopeDatasource" gorm:"not null;embedded;embeddedPrefix:pyroscope_"` | ||
} | ||
|
||
type DatasourceInfo struct { | ||
Uid string `json:"uid"` | ||
} | ||
|
||
// AnalysisMeta represents metadata about the analyses | ||
type AnalysisMeta struct { | ||
CountsByStage map[string]interface{} `json:"countsByStage"` | ||
Items []Analysis `json:"items"` | ||
} | ||
|
||
type Investigation struct { | ||
ID uuid.UUID `json:"id" gorm:"primarykey;type:char(36)" validate:"isdefault"` | ||
CreatedAt time.Time `json:"created" gorm:"index" validate:"isdefault"` | ||
UpdatedAt time.Time `json:"modified" validate:"isdefault"` | ||
|
||
TenantID string `json:"tenantId" gorm:"index;not null;size:256"` | ||
|
||
Datasources DatasourceConfig `json:"datasources" gorm:"embedded;embeddedPrefix:datasources_"` | ||
|
||
Name string `json:"name"` | ||
RequestData InvestigationRequest `json:"requestData" gorm:"not null;embedded;embeddedPrefix:request_"` | ||
|
||
// TODO: Add this when we want to extract discovered inputs for later usage | ||
// Inputs Inputs `json:"inputs" gorm:"serializer:json"` | ||
|
||
// GrafanaURL is the Grafana URL to be used for datasource queries | ||
// for this investigation. | ||
GrafanaURL string `json:"grafanaUrl"` | ||
|
||
// Status describes the state of the investigation (pending, running, failed, or finished). | ||
Status InvestigationStatus `json:"status"` | ||
|
||
// FailureReason is a short human-friendly string that explains the reason that the | ||
// investigation failed. | ||
FailureReason string `json:"failureReason,omitempty"` | ||
|
||
// Analyses contains metadata about the investigation's analyses | ||
Analyses AnalysisMeta `json:"analyses"` | ||
} | ||
|
||
type RequestData struct { | ||
Labels map[string]string `json:"labels"` | ||
Checks []string `json:"checks"` | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think some of these types, annotations etc are redundant here (e.g. the gorm
tags aren't needed and some of the types are unused). Would be good to keep as many internals out of here as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments on naming / nits on exporting inline.
I'm trying to avoid exporting anything other than the actual tool definitions & handlers and the AddXTools
functions where possible!
tools/sift.go
Outdated
// FindErrorPatternLogs is a tool for running an ErrorPatternLogs check | ||
var FindErrorPatternLogs = mcpgrafana.MustTool( | ||
"find_error_pattern_logs", | ||
"Creates an investigation to search for error patterns in logs, waits for it to complete, and returns the analysis results. This tool triggers an investigation in the relevant Loki datasource to determine if there are elevated error rates compared to the last day's average, and returns the error pattern found, if any.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we get rid of mentions of an 'investigation' here too, so it looks like the tool just opaquely finds error patterns in logs? (the implementation isn't super important imo)
Same with slow requests
cc612e2
to
db86474
Compare
tools/sift.go
Outdated
|
||
// FindErrorPatternLogsParams defines the parameters for running an ErrorPatternLogs check | ||
type FindErrorPatternLogsParams struct { | ||
Name string `json:"name" jsonschema:"required,description=The name of the investigation"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Name string `json:"name" jsonschema:"required,description=The name of the investigation"` | |
Name string `json:"name" jsonschema:"required,description=The name of the investigation"` |
I wonder if we can omit this and generate something, or if it's better to just let the model provide some random name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it'd definitely be nice to include some info on origin/user here eventually. This is good for now though!
Co-authored-by: Ben Sully <[email protected]>
Co-authored-by: Ben Sully <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Nice one
Adding Sift tools for
Also there is cloud test added checking the get tools for sift.
Implementation Approach:
I initially created separate tools for basic Sift API operations (create/get investigations) to reduce code duplication. However, for specific checks like ErrorPatternLogs and SlowRequests, I created dedicated tools that handle the entire workflow internally.
Current Workflow Example:
When running ErrorPatternLogs in Sift, the process involves:
This can be done through:
run_error_pattern_logs
tool (which makes three API calls internally)Questions for Discussion:
Should we provide:
I'm leaning toward removing the createInvestigation tool and keeping just the specialized Sift workflow tools, but I'm open to suggestions.
Note: I've also refactored cloud tests into a utils file for better organization.