SKIL: Semantic Keypoint Imitation Learning for Generalizable Data-efficient Manipulation

1Tsinghua University, 2Shanghai AI Laboratory, 3Shanghai Qi Zhi Institute

Overview

Abstract image

We propose Semantic Keypoint Imitation Learning (SKIL), a framework which automatically obtain semantic keypoints with help of vision foundation models, and forms the descriptor of semantic keypoints that enables effecient imitation learning of complex robotic tasks with significantly lower sample complexity. In real world experiments, SKIL doubles the performance of baseline methods in tasks such as picking a cup or mouse, while demonstrating exceptional robustness to variations in objects, environmental changes, and distractors. For long-horizon tasks like hanging a towel on a rack where previous methods fail completely, SKIL achieves a mean success rate of 70% with as few as 30 demonstrations. Furthermore, SKIL naturally supports cross-embodiment learning due to its semantic keypoints abstraction, our experiments demonstrate that even human videos bring considerable improvement to the learning performance. All these results demonstrate the great success of SKIL in achieving data-efficint generalizable robotic learning.

SKIL Framework

SKIL Framework

Overview of our framework SKIL, including Semantic Keypoints Description Module and Policy Module. The first module computes descriptors for the semantic keypoints. Then, we apply a transformer encoder to obtain the fused embedding of the keypoints. Conditioned on the fused embedding and robot state, a diffusion action head outputs the final action sequence.

Process of generating reference features, given a single reference image of the specific task: (1) Apply SAM and Vision Foundation Model to obtain the mask and the feature map individually; (2) Cluster the masked features to obtain the reference features using K-means.

One-time Reference Features Generation

Evaluation Videos

We use a Franka robot arm equipped with a Robotiq gripper to perform six real-world tasks, including the first four short-horizon tasks and the last two long-horizon ones. Our experiments demonstrate the strong generalization ability and the excelling data efficiency of SKIL under diverse unseen objects and scenes.

Short-horizon Tasks

Grasp Handle of Cup

Seen object
Seen scene

Unseen object
Seen scene

Unseen object
Seen scene

Unseenobject
Unseen scene

Grasp Wall of Cup

Seen object
Seen scene

Unseen object
Seen scene

Unseen object
Seen scene

Unseenobject
Unseen scene

Pick Mouse

Seen object
Seen scene

Unseen object
Seen scene

Unseen object
Seen scene

Unseenobject
Unseen scene

Fold Towel

Seen object
Seen scene

Unseen object
Seen scene

Unseen object
Seen scene

Unseenobject
Unseen scene

Long-horizon Tasks

Hang Towel

Seen object
Seen scene

Unseen object
Seen scene

Unseen object
Seen scene

Unseenobject
Unseen scene

Hang Cloth

Seen object
Seen scene

Unseen object
Seen scene

Unseen object
Seen scene

Unseenobject
Unseen scene

Discussions

Comparison with Baselines

We benchmark SKIL against baselines including DP, DP3, RISE and GenDP algorithms. Realworld results are measured by evaluation phase success rates on the unseen testing objects. SKIL outperforms baseline methods by a large margin on either training or testing objects. More details can be found in the paper.

Visualization of Semantic Keypoints

For an intuitive glance of SKIL's keypoint-based representation, the following figure illustrates the moving trajectories of the matched semantic keypoints on several tasks. Green points represent the current keypoints, and the white flows show their previous trajectories. Temporal consistency is maintained across objects within the same category.

Ablation on Vision Foundation Models

We conduct ablation studies on various vision foundation models, including DiFT, DINOv2, and RADIO. More details can be found in the paper. A comparison of success rates and inference latency for different vision foundation models (DINOv2, DiFT, and RADIO v2.5) on an NVIDIA A10 GPU.

Citation

If you find our work helpful, please cite us:
@article{wang2025skil,
    title={SKIL: Semantic Keypoint Imitation Learning for Generalizable Data-efficient Manipulation},
    author={Wang, Shengjie and You, Jiacheng and Hu, Yihang and Li, Jiongye and Gao, Yang},
    journal={arXiv preprint arXiv:2501.14400},
    year={2025}
}
Thank you!