add codesearch tool #6458

dvyukov · 2025-11-17T10:37:50Z

add skeleton for code searching tool

Add a clang tool that is used for code indexing (tools/clang/codesearch/).
t follows conventions and build procedure of the declextract tool.

Add pkg/codesearch package that aggregates the info exposed by the clang tools,
and allows doing simple queries:

show source code of an entity (function, struct, etc)
show entity comment
show all entities defined in a source file

Add tools/syz-codesearch wrapper tool that allows to create index for a kernel build,
and then run code queries on it.

dvyukov · 2025-11-17T10:40:37Z

@sirdarckcat FYI This is something we will be able to reliable use in prod, maintain, extend and fix as necessary.

tarasmadan · 2025-11-17T10:47:08Z

We have many tools and packages. Documentation page about what tools could be used for agentic games may help others to sharpen the focus.

dvyukov · 2025-11-17T10:49:32Z

We have many tools and packages. Documentation page about what tools could be used for agentic games may help others to sharpen the focus.

For now it's just this one. None of the existing tools were developed for agents.

dvyukov · 2025-11-17T10:50:55Z

@sirdarckcat FYI This is something we will be able to reliable use in prod, maintain, extend and fix as necessary.

Here is how it can be wired into aflow infra. It builds and caches the index as part of the workflow, and then uses the index to answer agent queries:

https://github.com/google/syzkaller/pull/6433/files#diff-703cd3d923312b4421c2d639cf4a72f8e56b918fdf9b3298d86313d8f6658138

sirdarckcat

Super cool! This should help resolve the usecase for code navigation.

For more complex queries (like AST and Ctrl flow) do you want to build on top of this as needed? Note that may transform into a lot of work if it turns out it's necessary

pkg/codesearch/codesearch.go

dvyukov · 2025-11-17T12:05:30Z

Super cool! This should help resolve the usecase for code navigation.

For more complex queries (like AST and Ctrl flow) do you want to build on top of this as needed?

Yes.

Note that may transform into a lot of work if it turns out it's necessary

We already know how to do control and dataflow for interface inference:
https://github.com/google/syzkaller/blob/master/pkg/declextract/typing.go
If needed, we can reuse lots of that code and ideas.

sirdarckcat · 2025-11-17T12:24:55Z

If needed, we can reuse lots of that code and ideas.

cool, ok!

I think the hardest questions we will need to answer (and if we don't, then we need at least a query into git grep):

tell me where functions of X type (as inferred by sizeof and castings) got allocated and freed (including following wrappers)
tell me where accesses to this field are guarded by another field (refcnts, locks, etc)

dvyukov · 2025-11-17T13:30:19Z

tell me where functions of X type (as inferred by sizeof and castings) got allocated and freed (including following wrappers)

Amusingly, @melver just implemented this in clang for heap partitioning hardening.
@melver @a-nogikh can you point to relevant clang logic to borrow?

dvyukov · 2025-11-17T13:36:07Z

tell me where accesses to this field are guarded by another field (refcnts, locks, etc)

What would be the heuristic to match locks with protected fields?
I am thinking of a case where a very large piece of code (transitive closure of call graph, maybe including some very cold paths like swapping) happens within a critical section, it would be wrong to assume that all fields every accessed in all that code are protected by the mutex.

If we also give wrong answers (both false positives and false negatives), it may confuse LLM.
Perhaps it's better to expose simpler and more reliable primitives: where things/fields are references, call graph info, which functions do locking/unlocking, maybe names of locks, etc. Ideally, LLM should be able to combine these primitives to answer more complex questions.

sirdarckcat · 2025-11-17T13:45:55Z

If we also give wrong answers (both false positives and false negatives), it may confuse LLM.

We would only give this tool to the LLM when it is already debugging a potential lock issue, for example.

That said, that feels like a problem for later (as long as we have enough information to answer these questions, whether we do it in a function or let the LLM do it manually is an implementation detail we can decide based on experimentation)

melver · 2025-11-17T14:50:38Z

tell me where functions of X type (as inferred by sizeof and castings) got allocated and freed (including following wrappers)

Amusingly, @melver just implemented this in clang for heap partitioning hardening. @melver @a-nogikh can you point to relevant clang logic to borrow?

If you use latest LLVM repo you can just use infer_alloc::inferPossibleType: https://github.com/llvm/llvm-project/blob/29e7b4f9a72576a2901407834b988ec37f931d28/clang/include/clang/AST/InferAlloc.h#L25

Edit: As-is, it finds sizeof expressions. But doesn't automatically figure out the casts, but expects the caller to keep track of the inner-most cast and pass it in.

melver · 2025-11-17T14:56:02Z

tell me where functions of X type (as inferred by sizeof and castings) got allocated and freed (including following wrappers)

Amusingly, @melver just implemented this in clang for heap partitioning hardening. @melver @a-nogikh can you point to relevant clang logic to borrow?

More generally, beware of the issues of how kmalloc is designed (macro -> macro -> always_inline -> outline slab function). That is also apparent from the patch I built for the kernel heap partitioning prototype: https://lore.kernel.org/all/20250825154505.1558444-1-elver@google.com/

I suspect that if you build a special purpose kernel-only tool, that can be solved and the right call expression can be fed into infer_alloc::inferPossibleType.

dvyukov · 2025-11-19T17:28:25Z

ping @a-nogikh @tarasmadan @pimyn-girgis

pkg/codesearch/codesearch.go

Let tools verify that all source file names, line numbers, etc are valid/present. If there are any bogus entries, it's better to detect them early, than to crash/error much later when the info is used.

Add a clang tool that is used for code indexing (tools/clang/codesearch/). It follows conventions and build procedure of the declextract tool. Add pkg/codesearch package that aggregates the info exposed by the clang tools, and allows doing simple queries: - show source code of an entity (function, struct, etc) - show entity comment - show all entities defined in a source file Add tools/syz-codesearch wrapper tool that allows to create index for a kernel build, and then run code queries on it.

dvyukov changed the title ~~dvyukov codesearch tool~~ add codesearch tool Nov 17, 2025

dvyukov requested review from a-nogikh, pimyn-girgis and tarasmadan November 17, 2025 10:38

dvyukov force-pushed the dvyukov-codesearch-tool branch from 19ad222 to 97e8e12 Compare November 17, 2025 10:48

sirdarckcat approved these changes Nov 17, 2025

View reviewed changes

pkg/codesearch/codesearch.go Outdated Show resolved Hide resolved

pkg/codesearch/codesearch.go Show resolved Hide resolved

dvyukov force-pushed the dvyukov-codesearch-tool branch from 97e8e12 to cd178f7 Compare November 17, 2025 13:22

dvyukov force-pushed the dvyukov-codesearch-tool branch from cd178f7 to 4de58a2 Compare November 17, 2025 13:37

a-nogikh previously approved these changes Nov 19, 2025

View reviewed changes

pkg/codesearch/codesearch.go Outdated Show resolved Hide resolved

dvyukov added 2 commits November 20, 2025 10:26

pkg/clangtool: allow final verification of output

71af9de

Let tools verify that all source file names, line numbers, etc are valid/present. If there are any bogus entries, it's better to detect them early, than to crash/error much later when the info is used.

dvyukov dismissed a-nogikh’s stale review via 66b8f6e November 20, 2025 09:31

dvyukov force-pushed the dvyukov-codesearch-tool branch from 4de58a2 to 66b8f6e Compare November 20, 2025 09:31

dvyukov requested a review from a-nogikh November 20, 2025 09:31

dvyukov enabled auto-merge November 20, 2025 09:38

a-nogikh approved these changes Nov 20, 2025

View reviewed changes

dvyukov added this pull request to the merge queue Nov 20, 2025

Merged via the queue into google:master with commit 280ea30 Nov 20, 2025
18 checks passed

dvyukov added the AI patching Feature requests and bugs related to AI-based kernel bug fix generation. label Jan 15, 2026

add codesearch tool #6458

add codesearch tool #6458

Uh oh!

Conversation

dvyukov commented Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dvyukov commented Nov 17, 2025

Uh oh!

tarasmadan commented Nov 17, 2025

Uh oh!

dvyukov commented Nov 17, 2025

Uh oh!

dvyukov commented Nov 17, 2025

Uh oh!

sirdarckcat left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

dvyukov commented Nov 17, 2025

Uh oh!

sirdarckcat commented Nov 17, 2025

Uh oh!

dvyukov commented Nov 17, 2025

Uh oh!

dvyukov commented Nov 17, 2025

Uh oh!

sirdarckcat commented Nov 17, 2025

Uh oh!

melver commented Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

melver commented Nov 17, 2025

Uh oh!

dvyukov commented Nov 19, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

dvyukov commented Nov 17, 2025 •

edited

Loading

melver commented Nov 17, 2025 •

edited

Loading