You don't need to know any of this to use the skill. But if you're wondering what Claude is actually doing when you paste a link:
- It downloads the video using a tool called yt-dlp (free, open source)
- It uses FFmpeg (also free) to split the video into individual frames — like a flipbook
- It grabs the transcript — either from existing captions (free) or by running the audio through Whisper AI (requires a free API key from Groq, only needed when captions don't exist)
- It reads every frame as an image while simultaneously reading the transcript with timestamps
- It answers your question based on what it actually saw and heard
The frame budget is smart — short videos (under 30 seconds) get nearly every frame. Longer videos get a sparser scan with a hard cap at 100 frames. You can always narrow the window with --start and --end for denser coverage of a specific section.