Skip to content

Conversation

@AlessandroAnnini
Copy link

Hello @gcapuzzi,
this is a second attempt after PR #2

I changed how the repo is cloned: no more github api and no more need to use a github PAT but just a simple git clone running a bash command in a subprocess.

The files are filtered by a list of allowed extensions called ext_whitelist, this way you can grab md, mdx, and other files all together.

The files that pass the filter are splitted using a specific splitter for markdown files that creates metadata about the titles in the file itself, this is good for the quality and i think the speed of the research in the vector db. But the problem is that it works only with markdown, so, if your ext_whitelist has .js files, those should actually use a different splitter. The splitter selection should be based on the extension and more splitters should be used dynamically.

I took a stab at using Langchain LCEL but I couldn't implement memory the way that I wanted so it is commented.

The rest is just a refine but nothing important.

p.s. keeping the code in a notebook is pretty inconvenient because in this PR you cannot really see what I have changed unless you go to my project and try that, the history of the changes in the file is non-existent too.

Signed-off-by: Alessandro Annini <alessandro.annini@gmail.com>
Signed-off-by: Alessandro Annini <alessandro.annini@gmail.com>
gcapuzzi added a commit that referenced this pull request Jan 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant