Description
Steps 3 and 4 could be simplified using tree-sitter queries. Instead of writing out the sexps of the parse trees and matching them with regexes, tree-sitter already supports capturing nodes that you describe in terms of the tree-sitter grammar.
For example, capture only top-level function_definitions and class_definitions.
(module [(function_definition)
(class_definition)] @top-level)
We could iterate over all @top-level
matches and extract either start/end points, or the start/end bytes if you just want to extract the content directly from tree-sitter. I think this is another example of where our description of what we want is actually a very small program but written using tree-sitter queries. For python, the LLMs already know this grammar so what I wrote above was generated by Claude.
The current regex is tied to the output format from the cli