We propose Semantic Keypoint Imitation Learning (SKIL), a framework which automatically obtain semantic keypoints with help of vision foundation models, and forms the descriptor of semantic keypoints that enables effecient imitation learning of complex robotic tasks with significantly lower sample complexity. In real world experiments, SKIL doubles the performance of baseline methods in tasks such as picking a cup or mouse, while demonstrating exceptional robustness to variations in objects, environmental changes, and distractors. For long-horizon tasks like hanging a towel on a rack where previous methods fail completely, SKIL achieves a mean success rate of 70% with as few as 30 demonstrations. Furthermore, SKIL naturally supports cross-embodiment learning due to its semantic keypoints abstraction, our experiments demonstrate that even human videos bring considerable improvement to the learning performance. All these results demonstrate the great success of SKIL in achieving data-efficint generalizable robotic learning.
Overview of our framework SKIL, including Semantic Keypoints Description Module and Policy Module. The first module computes descriptors for the semantic keypoints. Then, we apply a transformer encoder to obtain the fused embedding of the keypoints. Conditioned on the fused embedding and robot state, a diffusion action head outputs the final action sequence.
Process of generating reference features, given a single reference image of the specific task: (1) Apply SAM and Vision Foundation Model to obtain the mask and the feature map individually; (2) Cluster the masked features to obtain the reference features using K-means.
We use a Franka robot arm equipped with a Robotiq gripper to perform six real-world tasks, including the first four short-horizon tasks and the last two long-horizon ones. Our experiments demonstrate the strong generalization ability and the excelling data efficiency of SKIL under diverse unseen objects and scenes.
Seen object
Seen scene
Unseen object
Seen scene
Unseen object
Seen scene
Unseenobject
Unseen scene
Seen object
Seen scene
Unseen object
Seen scene
Unseen object
Seen scene
Unseenobject
Unseen scene
Seen object
Seen scene
Unseen object
Seen scene
Unseen object
Seen scene
Unseenobject
Unseen scene
Seen object
Seen scene
Unseen object
Seen scene
Unseen object
Seen scene
Unseenobject
Unseen scene
Seen object
Seen scene
Unseen object
Seen scene
Unseen object
Seen scene
Unseenobject
Unseen scene
Seen object
Seen scene
Unseen object
Seen scene
Unseen object
Seen scene
Unseenobject
Unseen scene
We benchmark SKIL against baselines including DP, DP3, RISE and GenDP algorithms. Realworld results are measured by evaluation phase success rates on the unseen testing objects. SKIL outperforms baseline methods by a large margin on either training or testing objects. More details can be found in the paper.
For an intuitive glance of SKIL's keypoint-based representation, the following figure illustrates the moving trajectories of the matched semantic keypoints on several tasks. Green points represent the current keypoints, and the white flows show their previous trajectories. Temporal consistency is maintained across objects within the same category.
We conduct ablation studies on various vision foundation models, including DiFT, DINOv2, and RADIO. More details can be found in the paper. A comparison of success rates and inference latency for different vision foundation models (DINOv2, DiFT, and RADIO v2.5) on an NVIDIA A10 GPU.
@article{wang2025skil,
title={SKIL: Semantic Keypoint Imitation Learning for Generalizable Data-efficient Manipulation},
author={Wang, Shengjie and You, Jiacheng and Hu, Yihang and Li, Jiongye and Gao, Yang},
journal={arXiv preprint arXiv:2501.14400},
year={2025}
}