You don't need to know any of this to use the skill. But if you're wondering what Claude is actually doing when you paste a link:

  1. It downloads the video using a tool called yt-dlp (free, open source)
  2. It uses FFmpeg (also free) to split the video into individual frames — like a flipbook
  3. It grabs the transcript — either from existing captions (free) or by running the audio through Whisper AI (requires a free API key from Groq, only needed when captions don't exist)
  4. It reads every frame as an image while simultaneously reading the transcript with timestamps
  5. It answers your question based on what it actually saw and heard

The frame budget is smart — short videos (under 30 seconds) get nearly every frame. Longer videos get a sparser scan with a hard cap at 100 frames. You can always narrow the window with --start and --end for denser coverage of a specific section.