What is Language Grounding?

Wikipedia offers a quite philosophical definition for Symbol Grounding:

In cognitive science and semantics, the symbol grounding problem concerns how it is that words (symbols in general) get their meanings, and hence is closely related to the problem of what meaning itself really is.

To illustrate this, reflect on what (sort of) happens in your mind when you see words on a page: These arbitrary symbols evoke "grounded" experiences which lead to your understanding of the text. You might see the word "dog" which effortlessly brings up a short "GIF" of how a dog looks like, moves or barks and your past experiences with a dog (they can hurt you, you can pet them and feel their hair on your fingers while doing so, ...).There is so much more that could come up which I can't all list here. And it will be more entangled and subconscious than what I described, especially if you read/speak/listen quickly instead of the described reflection on a single word.


So when I say Language Grounding, I mean this idea that words are connected to the real world and our experiences in it, and that's how they get their meaning.

Since language only makes sense when you're not the only person on earth, the social aspect is also important for the idea of language grounding.

Alternatively to get an intuition for language grounding, you can also ponder how children acquire language: what they see, what their parents show them, what they touch and how they move around; how it is interactive, social. And how they efficiently learn language to understand the (social) world around them.

Why do I care about Language Grounding?

I studied Computational Linguistics, so I tend to see things through the lense of Natural Language Processing, which I personally consider a subfield of AI these days. So with this in mind, let's make the idea of language grounding a bit more concrete and technical!

Over the last 1-2 years, it became my big dream to build systems that can speak with us in a truly human-like way . But at the end of my undergrad degree in Computational Linguistics I realized: Models solely trained on text will never reach this "human-like language understanding", as they neglect that language gets its meaning when agents interact in a complex multi-modal environment.

However the majority of models we currently use in NLP still train on nothing but text and we just hope that scaling data and model size will get us further and further.

These models have no connection to some external "real" world: All they "see" is discrete words (or what we call tokens in NLP). But is it possible to get a realistic idea of the world through that? Is it possible to infer spatial concepts from a lot of raw text (e.g. how a typical apartment is arranged)? Can you infer what movement is and how one usually moves around as a human (e.g. what it means to run down the stairs to catch a train)? How texture feels like? All of these constantly come up in text in one way or another, but often just implicitly.

In theory you could encode all of that explicitly in natural language somehow, e.g. in order to precisely describe the spatial nature of a typical apartment. You might start saying: "So room 1 is a cuboid, with lenghts 5,4 and 3 metres. In the middle of wall 3, there is a door, which has rectangular shape. It is 2 metres high...." These descriptions would have to go on for pages and even then it already requires some spatial understanding of terms like "high" or "rectangular". But who writes like that in practice?

So in the texts we train on, this spatial knowledge (a lot of other knowledge) is encoded way more implicitly and scattered over millions of documents.

In short: language is either a very inefficient encoding of many real world phenomena or it completely leaves it "as an exercise for the reader". Language is too high-level for this, but it depends on lower level phenomena and needs to have access to them.

One can argue that just enough scaling (see the GPT-3 hype) might still magically give the model enough scattered implicite encodings of real world phenomena. The same way you might learn maths through someone "dancing maths to you" for many years non-stop. At some point you might get it but it's not the best method to study maths for sure.

So, with grounding we would hopefully need less data, which might mean smaller models and less training time! And I believe that research on language grounding will also lead to a better cognitive and philosophical understanding of language compared to scaling text-only models.

So how can we improve the situation as researchers in NLP?