Description
I was looking at how to add the git info to the index, something like:
**Source Repository:** [https://github.com/Le09/Tutorial-Codebase-Knowledge/](https://github.com/Le09/Tutorial-Codebase-Knowledge/)
**Commit Hash:** d66d97639092051cd7eb0df82a96bec5a5b6bec4
**Branch Name:** main
So that it could potentially be used as a reference doc.
Also, having git info could be extended to had links to functions, classes, etc.
However, there are 3 different cases:
- local git repository
- remote repository, via ssh
- remote https repository
Case 1 is (mostly) easy; the only issue is that there might be a host alias.
Case 2 is problematic because the project is checked out in a temporary directory that is created within crawl_github_files
.
Case 3 uses the API so it may be less of a problem, there's less duplication of work.
Except for case 1, I think it's a flow in the abstraction, since crawl_github_files
is an isolated function, but there may be more that you want to extract from git.
Why have this complexity altogether, and not always clone the repository in .cache
?
If it's to save on size, it can be done with a depth 1, although the describe tags wouldn't work in that case.
But it's only relevant for large repositories, and the time spent cloning is dwarfed by the time calling the LLM whatever the size may be.
I've made a small commit for the local case: 01b7c28
Do you have an opinion on the matter to make it into a real PR?