We propose an asymmetric multi-resolution transformer architecture (denoted MResT) that utilizes the generlization capabilty of large vision-language models together with the robustness of multiple local sensing modalities (wrist-mounted views, vibro-tactile sensors) to perform a range of manipulation tasks particularly more precise and dynamic manipulation tasks.
Spatial Resolutions
Temporal Resolutions
To enable a general multi-task multi-resolution approach we would like to use pretrained vision-language models given their imperessive zero-shot generalization capabilites for image-text grounding. However, large vision-language models can have billions of parameters and thus are hard to scale for real-time control. Further, in practical settings with limited robot training data they are hard to adapt for the downstream task.
Figure below shows our overall multi-resolution architecture. Our work builds on three critical insights to learn generalizable multi-task policies that allow us to scale to coarse, precise and dynamic manipulation tasks..
We use three differents manipulation domains and real-world experiments to evaluate our proposed approach.
Above figure (Left) compares our multi-resolution approach to common single-resolution multi-task baselines
such as BC-Z and RT-1. While single-resolution approaches do well on coarse manipulation tasks (MT-Coarse), they
perform poorly on precise and dynamic tasks.
For our second result (Right) we show the importance of using a multi-temporal resolution approach.
For this, we use multiple spatial resolutions at fixed temporal resolutions, i.e. low-resolution (5 Hz),
high resolution (20 Hz). While fixed resolution approaches work well for
quasi-static tasks (MT-Coarse and MT-Precise) they fail for our dynamic task where fast-reaction to contact is necessary
for task success.
Figure on the right shows the generalization abilities of our approach. We compare our multi-resolution approach (with frozen VLM) against finetuned VLM. We show the percentage success on novel concepts only as well as training concepts (dotted blue). While there is performance drop on novel concepts (~5-10%), across all different concept variations (i.e. color, geometry) frozen representations outperform finetuned representations when evaluated on novel concepts.
@inproceedings{
saxena2023multiresolution,
title={Multi-Resolution Sensing for Real-Time Control with Vision-Language Models},
author={Saumya Saxena and Mohit Sharma and Oliver Kroemer},
booktitle={7th Annual Conference on Robot Learning},
year={2023},
url={https://openreview.net/forum?id=WuBv9-IGDUA}}