Recent advances in chain-of-thought (CoT) reasoning have improved complex video understanding, but existing methods often struggle to adapt to domain-specific skills (e.g., temporal grounding, event detection, spatial relations) over various video content. To address this, we propose Video-Skill-CoT a framework that automatically constructs and leverages skill-aware CoT supervisions for domain-adaptive video reasoning. First, we construct skill-based CoT annotations: We extract domain-relevant reasoning skills from training questions, cluster them into a shared skill taxonomy, and create detailed multi-step CoT rationale tailored to each video question pair for training. Second, we introduce a skill-specific expert learning framework. Each expert module specializes in a subset of reasoning skills and is trained with lightweight adapters using the collected CoT supervision. We demonstrate the effectiveness of the proposed approach on three video understanding benchmarks, where Video-Skill-CoT consistently outperforms strong baselines. We also provide in-depth analyses on comparing different CoT annotation pipelines and learned skills over multiple video domains.
This t-SNE plot visualizes the relative embedding distances of input questions across various video datasets. Questions from the same dataset tend to form tight clusters, reflecting shared domains or required skills. For instance, models pretrained on general datasets like LLaVA-Video-178K (Zhang et al., 2024) often fall short in capturing the nuanced narrative understanding required in datasets like CinePile (Rawal et al., 2024), highlighting the importance of adaptation to unfamiliar domains or specialized tasks.
Video-Skill-CoT automatically generates and utilizes skill-aware chain-of-thought (CoT) supervision for domain-adaptive video reasoning. First, as shown in (a), it extracts domain-relevant reasoning skills from training questions and organizes them into a shared skill taxonomy through clustering. Then, in (b), it generates detailed, multi-step CoT rationales tailored to each video-question pair for use in training. Finally, as illustrated in (c), a skill-specific expert learning framework is introduced: each expert module focuses on a subset of reasoning skills and is trained using lightweight adapters guided by the generated CoT supervision.
We compare the different annotated CoTs from the regular CoT (a) and our skill-based CoT (b). Given a question about which object is closest to the stove, the regular CoT (left) offers a linear, scene-based narration that lacks structure and includes irrelevant details (Camera first focuses ... it then pans to the right ...), making it often harder to extract key spatial information. In contrast, our skill-based CoT starts by identifying relevant skills (e.g., spatial proximity) and breaking the task into focused sub-questions, like comparing the washer and refrigerator.
We compare Video-Skill-CoT to recent MLLM baselines on three video understanding benchmarks (E.T-Bench, VSI-Bench, CinePile) with domains and required skills. Our approach consistently outperforms all baselines, achieving improvements of +4.10, +5.70, and +1.59 over the fine-tuned version of LLaVA-Video on E.T.-Bench, VSI-Bench, and CinePile, respectively.
We compare the impact of two key components: (1) skill-based CoT reasoning and (2) skill-specific expert modules. The full Video-Skill-CoT with both components achieves the best performance. Removing either the expert modules (2nd row), the skill-based CoT (3rd row), or both (last row) consistently degrades performance, showing their complementary roles.
Inference output comparison: (a) LLaVA-Video trained with regular CoT and (b) LLaVA-Video trained with our skill-based CoT. Video-Skill-CoT successfully generates temporally grounded and precise rationales that more effectively support accurate answer generation.
@article{lee2025video,
title={VIDEO-SKILL-COT: Skill-based Chain-of-Thoughts for Domain-Adaptive Video Reasoning},
author={Daeun Lee and Jaehong Yoon and Jaemin Cho and Mohit Bansal},
journal={arXiv preprint arXiv:2506.03525},
year={2025},
}