Video-Skill-CoT

Abstract

Recent advances in chain-of-thought (CoT) reasoning have improved complex video understanding, but existing methods often struggle to adapt to domain-specific skills (e.g., temporal grounding, event detection, spatial relations) over various video content. To address this, we propose Video-Skill-CoT a framework that automatically constructs and leverages skill-aware CoT supervisions for domain-adaptive video reasoning. First, we construct skill-based CoT annotations: We extract domain-relevant reasoning skills from training questions, cluster them into a shared skill taxonomy, and create detailed multi-step CoT rationale tailored to each video question pair for training. Second, we introduce a skill-specific expert learning framework. Each expert module specializes in a subset of reasoning skills and is trained with lightweight adapters using the collected CoT supervision. We demonstrate the effectiveness of the proposed approach on three video understanding benchmarks, where Video-Skill-CoT consistently outperforms strong baselines. We also provide in-depth analyses on comparing different CoT annotation pipelines and learned skills over multiple video domains.

Motivation: Video datasets require different reasoning skills

This t-SNE plot visualizes the relative embedding distances of input questions across various video datasets. Questions from the same dataset tend to form tight clusters, reflecting shared domains or required skills. For instance, models pretrained on general datasets like LLaVA-Video-178K (Zhang et al., 2024) often fall short in capturing the nuanced narrative understanding required in datasets like CinePile (Rawal et al., 2024), highlighting the importance of adaptation to unfamiliar domains or specialized tasks.

Skill-based CoT Annotation and Skill-Specific Expert Learning

Video-Skill-CoT automatically generates and utilizes skill-aware chain-of-thought (CoT) supervision for domain-adaptive video reasoning. First, as shown in (a), it extracts domain-relevant reasoning skills from training questions and organizes them into a shared skill taxonomy through clustering. Then, in (b), it generates detailed, multi-step CoT rationales tailored to each video-question pair for use in training. Finally, as illustrated in (c), a skill-specific expert learning framework is introduced: each expert module focuses on a subset of reasoning skills and is trained using lightweight adapters guided by the generated CoT supervision.

Comparison of CoT annotations

We compare the different annotated CoTs from the regular CoT (a) and our skill-based CoT (b). Given a question about which object is closest to the stove, the regular CoT (left) offers a linear, scene-based narration that lacks structure and includes irrelevant details (Camera first focuses ... it then pans to the right ...), making it often harder to extract key spatial information. In contrast, our skill-based CoT starts by identifying relevant skills (e.g., spatial proximity) and breaking the task into focused sub-questions, like comparing the washer and refrigerator.

BibTeX

@article{lee2025video,
      title={VIDEO-SKILL-COT: Skill-based Chain-of-Thoughts for Domain-Adaptive Video Reasoning}, 
      author={Daeun Lee and Jaehong Yoon and Jaemin Cho and Mohit Bansal},
      journal={arXiv preprint arXiv:2506.03525},
      year={2025},
}

Video-Skill-CoT: Skill-based Chain-of-Thoughts for
Domain-Adaptive Video Reasoning

Abstract

Motivation: Video datasets require different reasoning skills

Skill-based CoT Annotation and Skill-Specific Expert Learning

Comparison of CoT annotations

Quantitative Results

Qualitative Results

BibTeX

Video-Skill-CoT: Skill-based Chain-of-Thoughts for Domain-Adaptive Video Reasoning

Abstract

Motivation: Video datasets require different reasoning skills

Skill-based CoT Annotation and Skill-Specific Expert Learning

Comparison of CoT annotations

Quantitative Results

Qualitative Results

BibTeX

Video-Skill-CoT: Skill-based Chain-of-Thoughts for
Domain-Adaptive Video Reasoning