Visuo-motor policy learning has advanced robotic manipulation, but mastering precise, contact-rich tasks remains challenging due to vision's limitations in reasoning about contacts.
To solve this, several efforts have been made to integrate tactile sensors into policy learning. However, many of these efforts rely on optical tactile sensors that are either confined to recognition tasks or require complex dimensionality reduction steps for policy learning. This work looks at learning policies with magnetic skin sensors as they are natively low-dimensional, highly sensitive, and cheap to integrate on robotic platforms.
To do this effectively, we present VISK, a simple framework that uses a transformer-based policy and treats skin sensor data as additional tokens to vision-based information. Evaluated across four complex real-world tasks (credit card swiping, plug insertion, USB insertion, and bookshelf retrieval), VISK significantly outperforms vision-only and prior tactile models. Further analysis reveals that combining tactile and visual modalities enhances policy performance and spatial generalization, achieving an average improvement of 27.5% across tasks.
The following videos are learnt ViSk policy rollouts being executed on the robot at 1x speed.
We run 10 evaluations each across 3 seeds, on held out unseen target object positions for each task.