Skip to content

Conversation

@moxious
Copy link

@moxious moxious commented May 23, 2025

Associated Issue: #2623 (comment)

tl;dr what is this? It's a small python script with instructions that fetches YouTube transcripts and summarizes them into nice Markdown files. The intent here is to store public text associated with those videos. This is nice by itself, but when combined with this: open-telemetry/opentelemetry.io#6769 it gets better. Kapa can be trained on these, and OTel has a sustainable way to do Q&A on the website based on video.

Core of this PR is the python code which isn't that big. Most of the line changes are actual markdown files which are the output of the python code.

@moxious
Copy link
Author

moxious commented May 23, 2025

Current known limitations: this works by pulling a raw youtube transcript and then summarizing/cleaning up. So when the raw youtube transcript is imperfect (which it often is with names) errors do happen. And so "Reese Lee" becomes sometimes "Ree Lee" and "Adriana Villela" becomes "Adriana Villa". Both the "nice cleaned up version" and the "very messy YouTube original" are included for comparison (and also so it's harder for OpenAI to fool me)

@moxious
Copy link
Author

moxious commented May 27, 2025

The spell checker action will ultimately be impossible to pass with raw YouTube transcripts; in many cases it also flags names (some correct, some incorrect) as unknown words. Will probably need some advice on what to do in this case since there's some tension between "capture what people said" and "make sure it's correct"

@dmathieu
Copy link
Member

cspell could be made to ignore the transcripts folder.

@danielgblanco
Copy link
Contributor

danielgblanco commented May 27, 2025

As this is aimed at YouTube transcripts, and I see how it can be really useful for the content the End-User SIG publishes, would it make more sense if this PR is opened against https://github.com/open-telemetry/sig-end-user ?

cc @avillela @reese-lee

@svrnm
Copy link
Member

svrnm commented Jun 2, 2025

As this is aimed at YouTube transcripts, and I see how it can be really useful for the content the End-User SIG publishes, would it make more sense if this PR is opened against open-telemetry/sig-end-user ?

cc @avillela @reese-lee

Not all recordings are from End User SIG right? I think we can start with community and later see if there is better places to have them

@danielgblanco
Copy link
Contributor

You're right. We do have YouTube videos that come from Comms SIG. However, as those tends to refer to documentation, do we think this tool is equally useful there? My thinking of putting this in a repo that's not community is that it'd make permissions easier to maintain those scripts.

@trask
Copy link
Member

trask commented Jun 3, 2025

hi @moxious, can you send this PR to https://github.com/open-telemetry/sig-end-user instead? thanks

@moxious
Copy link
Author

moxious commented Jun 4, 2025

yep. Traveling now, but will do early next week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants