AugInsert

Abstract

Operating in unstructured environments like households requires robotic policies that are robust to out-of-distribution conditions. Although much works has been done in evaluating robustness for visuomotor policies, the robustness evaluation of a multisensory approach that includes force-torque sensing remains largely unexplored.

This work introduces a novel, factor-based evaluation framework with the hoal of assessing the robustness of multisensory policies in a peg-in-hole object assembly task. To this end, we develop a multisensory policy framework utilizing the Perceiver IO architecture to learn the task. We investigate which factors pose the greatest generalization challenges in object assembly and explore a simple multisensory data augmentation technique to enhance out-of-distribution performance. We provide a simulation environment enabling controlled evaluation of these factors.

Our results reveal that multisensory variations such as Grasp Pose present the most significant challenges for robustness, and naive unisensory data augmentation applied independently to each sensory modality proves insufficient to overcome them. Additionally, we find force-torque sensing to be the most informative modality for our contact-rich assembly task, with vision being the least informative. Finally we briefly discuss supporting real-world experimental results.

For additional experiments and qualitative results, including supplemental experiments carried out in our real-world environment, please refer to the supplementary material on this webpage!

Video

Model Architecture

Our behavior cloning framework implementation is based off of Robomimic. To encode our observations, we draw upon the success of visuotactile transformer encoders and utilize a similar attention-based mechanism for RGB and tactile modality fusion. Rather than performing self-attention directly with the input tokens, we found that introducing a cross-attention step similar to the PerceiverIO architecture seemed to work best for our task. We tokenize our inputs by computing linear projections of both visual patches (as in vision transformers) for RGB inputs and individual readings per timestep for the force-torque input, and then add modality-specific position encodings. We then cross-attend these input tokens with a set of 8 learned latent vectors that then travel through a series of self-attention layers before ultimately being compressed and projected (as in VTT) to an output latent embedding. We encode proprioception with a multilayer perceptron to get an output embedding and concatenate both output embeddings to get the input to the policy network. The policy network is then a multilayer perceptron that outputs 3-dimensional delta actions.

Supplementary Experiments and Results

Additional Real World Experiments

Unisensory vs. Multisensory Data Augmentation

Interpolate start reference image. To validate our exploration of multisensory online augmentation in the real world, we train models with different levels of data augmentation and evaluate their performance in real-world versions of each task variation. The Canonical model was trained only on the original dataset of 50 human demonstrations with no data augmentation. The Unisensory Augmentation model was trained using conventional offline data augmentation methods applied to each modality individually (resized crop, horizontal flip, and color jitter for RGB data, and random scaling in the range of 0.1 to 2.0 for force-torque data). Finally, the Multisensory Augmentation model was trained on the original dataset of 50 human demonstrations alongside 1 Grasp Pose (XT) augmentation for each human demonstration. When performing online augmentation, each grasp perturbation was sampled independently between each human-generated trajectory. Reported success rates over 20 rollouts can be found in the accompanying figure.

Takeaways: Once again, we see that Grasp Pose variations lead to the largest success rate drops for the Canonical model, while unisensory variations such as Scene Appearance have little to no impact. The Multisensory Augmentation model achieves the highest success rate on Grasp Pose variations, while maintaining slightly reduced performance on many of the other task variations compared to the Canonical model, achieving the smallest generalization gap between the Canonical environment and Grasp Variations environment and validating our simulation results to an extent. The Unisensory Augmentation model surprisingly failed to learn an adequate policy to solve the base task, although it was still able to achieve higher performance on Grasp Pose variations than the Canonical model. We hypothesize that this may be due to training instabilities with the offline data augmentation method or the saturation of the MLP policy network. In the future, we hope to expand this real-world study by introducing more augmentations per human demonstration, which may futher improve the performance of the Multisensory Augmentation model.

Real World Rollouts: Model Trained on No Variations

Model evaluated on

task variation (videos sped up x4)

NOTE: 5 rollouts per video (20 rollouts for experiment results).

Additional Simulation Experiments

Attention Visualization

To gain further insight into the information being learned by our model, we visualize the attention weights in the latent vector cross-attention step of the transformer visuotactile encoder. For each modality, we plot attention weights as proportions of total attention to tokens in that specific modality averaged over the 8 learned latent vectors. These weights are visualized as heatmaps overlayed on left and right wristview images for visual attention, and bars for each timestep under the force reading for tactile attention. We also plot the proportion of total attention for each modality (visual and tactile) during the course of a rollout.

Attention Visualization: Model Trained on Grasp Pose, Peg/Hole Shape, and Object Body Shape

Model evaluated on

task variation (videos sped up x2)

NOTE: 5 rollouts per video (50 rollouts for experiment results).

Takeaways: Despite our model taking in twice as many visual tokens (72 tokens, 36 per view) as tactile ones (32 tokens), we observe that tactile attention takes up almost the entire proportion of attention across the input (as seen in the right-most plot of the videos). This finding provides further evidence to the importance of tactile information over visual information as discussed in our paper, where we found that removing visual information from our input had little impact on the robustness of our model. Furthermore, we observe that the visual attention is mostly focused on semantically insignficant parts of the scene, such as the gripper at the bottom of the view, suggesting that the model is not receiving much useful visual information.

Comparing Data Augmentation Methods for Model Robustness

In an effort to evaluate the validity of the online augmentation method for increasing the robustness of our model, we construct a dataset of human-generated trajectories with an extended set of visual variations and sensor noise, attempting to emulate a baseline data augmentation method that applies augmentations independently to each sensory modality offline during training. More specifically, we generate a dataset with training set variations of Scene Appearance (including object color, floor texture, and lighting), Camera Pose, and Sensor Noise with 12 augmentations per demonstration, but rather than keep applied variations consistent through each augmented rollout, we apply new instances of Scene Appearance and Camera Pose variations in each step of the demonstration. We also multiply the force and torque history reading by a random constant (from 0.1 to 2.0) independently determined each frame, following a similar data augmentation strategy used in InsertionNet. We denote this dataset as Expanded Visual+Noise.

Visualization of augmented observations collected for the Expanded Visual+Noise Dataset

We report % success rate change from the canonical environment success rate on models trained on the Expanded Visual+Noise dataset and compare it with the original training set models from our paper (namely Visual+Sensor Noise that does not apply new variation instances per frame and Base that includes Grasp Pose variations).

Takeaways: We observe that our dataset with an expanded set of augmentations independently applied to each sensory modality does not necessarily improve robustness on most task variations (save for Peg/Hole Shape) compared to the original Visual+Sensor Noise dataset that was less extensive in terms of data augmentation. Most crucially, we do not see a significant improvement on Grasp Pose variations, validating the effect of non-independent multisensory data augmentation via trajectory replay. Thus, we have shown that even extensive independent augmentation of our multisensory input may not be enough to deal with certain task variations involved in our contact-rich task.

Success Rates in Canonical Environment

For full transparency for our experiments that involve reporting the % sucess rate change from the Canonical environment, we explicitly report the success rates in the no-variations Canonical environment, which the % success rate change is based off of, for each trained model. Success rates are averaged across 6 training seeds, with error bars representing one standard deviation from the mean. It is worth noting that the average % success rate change across the 6 training seeds was calculated by determining the % success rate change for each individual seed and then calculating the average over those values, rather than calculating the average success rate across the 6 seeds first and then determining the difference of those averages.

Left Plot: Success rates on Canonical environment for models with different training set variations. This graph corresponds to the results reported in Figure 5 in our paper.

Center Plot: Success rates on Canonical environment for models with different modality input combinations, trained on No Variations. This graph corresponds to the results reported in the top graph of Figure 7 in our paper.

Right Plot: Success rates on Canonical environment for models with different modality input combinations, trained on Grasp Pose, Peg/Hole Shape, and Object Body Shape. This graph corresponds to the results reported in the bottom graph of Figure 7 in our paper. We especially note the instability of performance in the No Vision model, which provides context for its omission in the corresponding plot in our paper.

BibTeX

@article{diaz2024auginsert,
  title     = {AugInsert: Learning Robust Visual-Force Policies via Data Augmentation for Object Assembly Tasks}, 
  author    = {Diaz, Ryan and Imdieke, Adam and Veeriah, Vivek and Desingh, Karthik},
  booktitle = {arXiv preprint arXiv:2410.14968},
  year      = {2024},
}

AugInsert: Learning Robust Visual-Force Policies via Data Augmentation for Object Assembly Tasks

IROS 2025