Archived entries from file /home/mgalgs/src/makepkg-checkwrapper/TODO.org
> Let's add a new flag --audit=agentic, and give the LLM some tool calls: - listdir - readfile In this mode, we'll instruct the LLM that it's performing a security audit and tell it what tools it has available and ask it to go to town. It still needs to produce a report with the xml tags for us to parse. We should always give it the intial directory listing since presumably it's always going to want to at least do one directory listing at the top-level. Let's just provide that up front. This was inspired by the following session, where I'm trying to make sure that this tool would have actually caught the recent google-chrome-stable vulnerability. ``` > ./aur-sleuth --audit=sources google-chrome-stable Created temporary directory: /tmp/aur-sleuth-rmn9jrhe Cloning https://aur.archlinux.org/google-chrome-stable.git... Running makepkg --nobuild to download sources... --- Auditing Source Files --- [AUDIT] Checking source file: /tmp/aur-sleuth-rmn9jrhe/src/eula_text.html (53070 bytes) ERROR: Could not audit source file /tmp/aur-sleuth-rmn9jrhe/src/eula_text.html: mismatched tag: line 24, column 2 AUDIT FAILED. See reasons above. Cleaning up temporary directory: /tmp/aur-sleuth-rmn9jrhe ``` Hopefully the agent would be smart enough to not even bother "auditing" the eula_text.html, but would instead hone in on the shell script, which contains the malicious curl command. Standard agentic loop, keep it clean.It should actually encapsulate the LLM calls. The OpenAI client will live inside the session class. That way it can track sizes internally without callers having to track sizes.
XML seems to be processed better by LLMs.
Just use python logging framework and send it to the tmp debug log file.Currently we only flush the logs at the end, but sometimes things get hung before that. We should configure `logging` to get logs right away.
Currently we’re writing quite a bit to stdout. Instead of sending the full audit to stdout we should write it to a log file (different than the debug log file, which is already quite noisy), and provide a rich, TUI display to the user.We’ll assume that users have a modern terminal installed with plenty of support for everything we need to make a responsive terminal user interface.
Here’s how I imagine the output:
“` ./aur-sleuth –audit=agentic google-chrome-stable
,----
| {spinner} Analyzing google-chrome-stable |
`---- [{Status details}] “`
(but with a full box, this is just a mockup)
And rather than pushing stdout, we would refresh the content inside the box with current status:
“` ,----
| {spinner} Auditing PKGBUILD… |
`---- [0 issues found, 2 files left to process] “`
And when a single “box action” is completed we add some newlines to “finalize” that box and push it up.
(imagine an issue was found there and then we move on to install.sh)
“` ,----
| X PKGBUILD Audit failure |
| {3 sentence description of failure} |
`----
,----
| {spinner} Auditing install.sh… |
`---- [1 issue found, 1 file left to process] “`
“` ,----
| Audit complete! Result: FAIL |
|---|
| Issues found: |
| - {3 sentence description of first issue} |
| Full audit report can be found in /tmp/aur-sleuth-report.txt |
`---- “`
A success would just be:
“` ,----
| Audit complete! Result: SUCCESS |
|---|
| Full audit report can be found in /tmp/aur-sleuth-report.txt |
`---- “`
Agent can decide which files (possibly all) to readIn order to maintain its own state, I think the agent is going to need a “WriteFile” tool to keep a checklist of files it needs to review. Or do you think it will be able to keep it in its context? I worry that it will forget since it could be reading some huge files, so the history of files it has already read is going to fall out of context. It’s almost like during the code review portion it would be a “recursive” agentic LLM call, don’t need the full audit conversation history to perform a code review of a single file. And the audit conversation history only needs to record the outcome of the review in its context, not the code listing or the code review / audit report. What’s the canonical/proper way to handle this in an agentic LLM loop?
Create a wrapper class for our “client” instance (OpenAI). This will contain an OpenAI instance, and also take care of aggregating all costs and token usage throughout the whole session.For OpenRouter we can use their API to dynamically retrieve up-to-date pricing info:
> curl -s https://openrouter.ai/api/v1/models | jq ‘.data[] | select(.id == “qwen/qwen3-coder”) | .pricing’ { “prompt”: “0.0000002”, “completion”: “0.0000008”, “request”: “0”, “image”: “0”, “audio”: “0”, “web_search”: “0”, “internal_reasoning”: “0” }
Our system and users prompts are currently both vulnerable to prompt injection since they’re using unsanitized user inputs. Please fix.