Questions and Suppositions about Natural Language Understanding From Someone Who Is Now Unsure of What "Understanding" Even Means

As a starting point for a research project about how advances in natural language understanding will affect the professional services industry, I’m trying to come up with a formal definition of natural language understanding and how it can be measured. So far, I’m stumped. This post doesn’t have a firm conclusion; instead, I’m going to describe the thought process I’ve gone through, which poses an open-ended and almost philosophical question: what does “understanding” even mean, and does it have different meanings for machines than for humans?

Up until now I’ve informally defined natural language understanding is a part “the ability of machines to understand you when you talk to them.” To begin my search for a formal measurement of machine capability at talking with people, I went to the Electronic Frontier Foundation’s AI Progress Measurement Project, which crowdsources the latest objective metrics on machine capabilities from experts.

The only category EFF had that was related to a machine’s speaking and listening ability was “speech recognition.” So at first, I thought that this might be a good measurement of natural language understanding. But then I saw that speech recognition was measured by a program’s ability to accurately transcribe a large database of phone conversation recordings called Switchboard. And I remember transcribing radio interviews for the press team on the Deeds campaign. If I hadn’t paused the recording every 20-30 seconds, I would have made a ton of mistakes. But that didn’t mean I wasn’t understanding every word they were saying. So transcription is a completely different process from understanding.

So what happens when a person hears and understands something? When you listen to someone talk, the sounds they are making trigger thoughts and feelings in your mind, provided that your mind has the semantic memory of the concepts those sounds are supposed to represent, provided that you and the person talking to you speak the same language.

When I thought about understanding that way, it seemed impossible that a machine may be anywhere close to human-level performance at natural language understanding. The process of recognizing a vast three-dimensional galaxy of meaning, where every common and many rare physical and intellectual phenomena in the world are represented, seems like an aspect of general human intelligence. In other words, it seems to me that the ability to truly understand what a human is saying, no matter what she says, would be impossible to create in a machine absent artificial general intelligence.

Clearly, natural language understanding is a thing that some machines can do at least a little bit, and there must be a way to measure it. But since those machines don’t have the semantic memory that underlies a human’s language comprehension, natural language processing must refer to programs with limited expected inputs and available outputs. And that natural language understanding must consist of transcription plus really good analytics that pair whatever a human just said into the microphone with the most correct response.

And if that analysis is correct, it leads to yet another point of confusion: if natural language processing is about a program’s ability to determine an input’s meaning relative to possible outputs, isn’t an assessment of that machine’s natural language processing capability entirely dependent on the range of outputs that are available? Wouldn’t it be like comparing apples and oranges to measure the natural language processing ability of a really complex program with 100 available outputs with that of a simple one with 2 possible outputs?

Or is there a standard of measuring language “understanding” where a machine might truly understand English, even if a given input is irrelevant to the program's available outputs?

Consider a soda machine, which only dispense soda and say, “here is your [soda.name], thank you.” And you say, “Soda machine, I’d like a coconut souffle.” Can that machine “understand” what you said, even if it doesn’t know what a coconut souffle is?

Or is natural language understanding just being able to tell the difference between “give me a fucking coke,” “I’d like a Coke please,” and “I shall have the caffeinated brown sugar water in the bright red can?”

I suspect the latter case hints at an operating definition of natural language understanding. But I’m still confused as to how that can be measured in across applications of varying complexity.