Why Feature Concatenation Fails for Robots That See and Feel?

by Hongyu Chen and Haonan Chen

Imagine reaching into your backpack to find your keys. Your eyes guide your hand to the opening, but once inside, you rely almost entirely on touch to distinguish your keys from your wallet, phone, and other items. This seamless transition between sensory modalities (knowing when to rely on vision versus touch) is something humans do effortlessly but robots struggle with.

The challenge isn't just about having multiple sensors. Modern robots are equipped with cameras, tactile sensors, depth sensors, and more. The real problem is how to integrate these different sensory streams, especially when some sensors provide sparse but critical information at key moments.

The Modality Sparsity Problem

Current approaches to multimodal robot learning typically use feature concatenation: take the embeddings from all sensor modalities, concatenate them into one large vector, and feed it into a single neural network policy. This seems reasonable: let the network figure out which sensors to pay attention to, right?

[ Lee et al. 2017]

Unfortunately, this approach has two fundamental flaws.

Problem 1: Sparse Modalities Treated as Noise

Consider a robot trained to retrieve a marker from an opaque bag. For 90% of this task, vision guides the robot as it approaches and positions its gripper at the bag's opening. But once the gripper enters the bag, vision becomes completely useless (the bag is opaque!) and tactile sensing becomes absolutely critical for finding and grasping the marker.

Here's the problem: feature concatenation treats statistically rare signals as noise. During training, the learning algorithm sees tactile information as mostly inactive (it's only crucial for 10% of the task) and downweights it, focusing instead on the always-active visual features. By the time the robot actually needs tactile feedback, the policy has essentially learned to ignore it.

We confirmed this empirically: our RGB+Tactile concatenation baseline achieved only 5% success on this occluded picking task, compared to 35% for RGB-only. Adding tactile information actually made performance worse! The tactile signal was being treated as noise that confused the network.

Problem 2: Cannot Add or Remove Modalities

The second major limitation of feature concatenation is its lack of modularity. When you want to add a new sensor to an existing robot or remove a faulty sensor, you must retrain the entire policy from scratch. This is because the monolithic network architecture tightly couples all modalities together at the feature level.

This creates severe practical problems:

Expensive retraining costs every time hardware changes
Cannot leverage existing policies when upgrading sensors
Difficult to deploy systems incrementally as sensors become available
System fails catastrophically when individual sensors malfunction