Skip to content

Conversation

@dvyukov
Copy link
Collaborator

@dvyukov dvyukov commented Nov 17, 2025

add skeleton for code searching tool

Add a clang tool that is used for code indexing (tools/clang/codesearch/).
t follows conventions and build procedure of the declextract tool.

Add pkg/codesearch package that aggregates the info exposed by the clang tools,
and allows doing simple queries:

  • show source code of an entity (function, struct, etc)
  • show entity comment
  • show all entities defined in a source file

Add tools/syz-codesearch wrapper tool that allows to create index for a kernel build,
and then run code queries on it.

@dvyukov dvyukov changed the title dvyukov codesearch tool add codesearch tool Nov 17, 2025
@dvyukov
Copy link
Collaborator Author

dvyukov commented Nov 17, 2025

@sirdarckcat FYI This is something we will be able to reliable use in prod, maintain, extend and fix as necessary.

@tarasmadan
Copy link
Collaborator

We have many tools and packages. Documentation page about what tools could be used for agentic games may help others to sharpen the focus.

@dvyukov dvyukov force-pushed the dvyukov-codesearch-tool branch from 19ad222 to 97e8e12 Compare November 17, 2025 10:48
@dvyukov
Copy link
Collaborator Author

dvyukov commented Nov 17, 2025

We have many tools and packages. Documentation page about what tools could be used for agentic games may help others to sharpen the focus.

For now it's just this one. None of the existing tools were developed for agents.

@dvyukov
Copy link
Collaborator Author

dvyukov commented Nov 17, 2025

@sirdarckcat FYI This is something we will be able to reliable use in prod, maintain, extend and fix as necessary.

Here is how it can be wired into aflow infra. It builds and caches the index as part of the workflow, and then uses the index to answer agent queries:

https://github.com/google/syzkaller/pull/6433/files#diff-703cd3d923312b4421c2d639cf4a72f8e56b918fdf9b3298d86313d8f6658138

Copy link
Member

@sirdarckcat sirdarckcat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super cool! This should help resolve the usecase for code navigation.

For more complex queries (like AST and Ctrl flow) do you want to build on top of this as needed? Note that may transform into a lot of work if it turns out it's necessary

@dvyukov
Copy link
Collaborator Author

dvyukov commented Nov 17, 2025

Super cool! This should help resolve the usecase for code navigation.

For more complex queries (like AST and Ctrl flow) do you want to build on top of this as needed?

Yes.

Note that may transform into a lot of work if it turns out it's necessary

We already know how to do control and dataflow for interface inference:
https://github.com/google/syzkaller/blob/master/pkg/declextract/typing.go
If needed, we can reuse lots of that code and ideas.

@sirdarckcat
Copy link
Member

If needed, we can reuse lots of that code and ideas.

cool, ok!

I think the hardest questions we will need to answer (and if we don't, then we need at least a query into git grep):

  1. tell me where functions of X type (as inferred by sizeof and castings) got allocated and freed (including following wrappers)
  2. tell me where accesses to this field are guarded by another field (refcnts, locks, etc)

@dvyukov dvyukov force-pushed the dvyukov-codesearch-tool branch from 97e8e12 to cd178f7 Compare November 17, 2025 13:22
@dvyukov
Copy link
Collaborator Author

dvyukov commented Nov 17, 2025

tell me where functions of X type (as inferred by sizeof and castings) got allocated and freed (including following wrappers)

Amusingly, @melver just implemented this in clang for heap partitioning hardening.
@melver @a-nogikh can you point to relevant clang logic to borrow?

@dvyukov
Copy link
Collaborator Author

dvyukov commented Nov 17, 2025

tell me where accesses to this field are guarded by another field (refcnts, locks, etc)

What would be the heuristic to match locks with protected fields?
I am thinking of a case where a very large piece of code (transitive closure of call graph, maybe including some very cold paths like swapping) happens within a critical section, it would be wrong to assume that all fields every accessed in all that code are protected by the mutex.

If we also give wrong answers (both false positives and false negatives), it may confuse LLM.
Perhaps it's better to expose simpler and more reliable primitives: where things/fields are references, call graph info, which functions do locking/unlocking, maybe names of locks, etc. Ideally, LLM should be able to combine these primitives to answer more complex questions.

@dvyukov dvyukov force-pushed the dvyukov-codesearch-tool branch from cd178f7 to 4de58a2 Compare November 17, 2025 13:37
@sirdarckcat
Copy link
Member

If we also give wrong answers (both false positives and false negatives), it may confuse LLM.

We would only give this tool to the LLM when it is already debugging a potential lock issue, for example.

That said, that feels like a problem for later (as long as we have enough information to answer these questions, whether we do it in a function or let the LLM do it manually is an implementation detail we can decide based on experimentation)

@melver
Copy link
Collaborator

melver commented Nov 17, 2025

tell me where functions of X type (as inferred by sizeof and castings) got allocated and freed (including following wrappers)

Amusingly, @melver just implemented this in clang for heap partitioning hardening. @melver @a-nogikh can you point to relevant clang logic to borrow?

If you use latest LLVM repo you can just use infer_alloc::inferPossibleType: https://github.com/llvm/llvm-project/blob/29e7b4f9a72576a2901407834b988ec37f931d28/clang/include/clang/AST/InferAlloc.h#L25

Edit: As-is, it finds sizeof expressions. But doesn't automatically figure out the casts, but expects the caller to keep track of the inner-most cast and pass it in.

@melver
Copy link
Collaborator

melver commented Nov 17, 2025

tell me where functions of X type (as inferred by sizeof and castings) got allocated and freed (including following wrappers)

Amusingly, @melver just implemented this in clang for heap partitioning hardening. @melver @a-nogikh can you point to relevant clang logic to borrow?

More generally, beware of the issues of how kmalloc is designed (macro -> macro -> always_inline -> outline slab function). That is also apparent from the patch I built for the kernel heap partitioning prototype: https://lore.kernel.org/all/20250825154505.1558444-1-elver@google.com/

I suspect that if you build a special purpose kernel-only tool, that can be solved and the right call expression can be fed into infer_alloc::inferPossibleType.

@dvyukov
Copy link
Collaborator Author

dvyukov commented Nov 19, 2025

ping @a-nogikh @tarasmadan @pimyn-girgis

a-nogikh
a-nogikh previously approved these changes Nov 19, 2025
Let tools verify that all source file names, line numbers, etc
are valid/present. If there are any bogus entries, it's better
to detect them early, than to crash/error much later when the
info is used.
Add a clang tool that is used for code indexing (tools/clang/codesearch/).
It follows conventions and build procedure of the declextract tool.

Add pkg/codesearch package that aggregates the info exposed by the clang tools,
and allows doing simple queries:
 - show source code of an entity (function, struct, etc)
 - show entity comment
 - show all entities defined in a source file

Add tools/syz-codesearch wrapper tool that allows to create index for a kernel build,
and then run code queries on it.
@dvyukov dvyukov added this pull request to the merge queue Nov 20, 2025
Merged via the queue into google:master with commit 280ea30 Nov 20, 2025
18 checks passed
@dvyukov dvyukov added the AI patching Feature requests and bugs related to AI-based kernel bug fix generation. label Jan 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

AI patching Feature requests and bugs related to AI-based kernel bug fix generation.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants