The Reliability of Lipreading

is it remarkable that it is possible for anyone to learn to lip-read?

Posted at — Nov 15, 2017

Lip-reading mainly concerns discerning non-audible words solely from facial movements. This is a task which seems very hard to do as the only visible aspects of facial movement are the lips, teeth, height of the jaw and partially the tongue. This information is formally codified in the International Phonetic Alphabet (IPA) which lists the possible phonemes the human mouth can produce. The chart includes one character per sound, and it is separated into vowels and pulmonic consonants. The consonants are grouped into place and manner, whereas vowels are arranged by height and proximity.

Firstly, it may not seem possible to lip-read as specific vowels are impossible to discern from physical movements. This is because vowels are created by constant pulmonic airflow which is manipulated by the tongue and the lips. Therefore, as the tongue is not usually visible, particular vowels are not discernible. It is however, possible to tell whether a vowel is rounded or not from the lips. This means that it is possible to identify cardinal vowels 1-5, which are unrounded, from 6-8, which are rounded. This provides a good estimate of the sound by mostly allowing for distinctions between front and back vowels. For this reason, vowel sounds are not very helpful in defining which specific phoneme, only what it cannot be.

Secondly, it is also viable to discern the height of the vowel. Vowels such as [i] and [u] are close vowels whereas [a] and [ɑ] are open vowels. This refers to how agape the mouth is when producing the sound and means that these vowels are visually distinct. Despite this, vowel height is largely determined by the individual’s accent, so this may only be a useful metric if you are familiar with the speaker.

Consonants in the IPA are split into ten areas but only three are visible - bilabial, labiodental and dental. This means that only roughly 30% of the total human sound production is visible. Even so, there is not an even distribution of phonemes among the areas. For example, the alveolar region covers all eight standard manners of consonants whereas the dental region only has one associated. Fortunately, words with laryngeal sounds are uncommon in English as they require more control to produce - the only common phonemes being k, g, h, j, and ŋ (other sounds are found in some dialects, such as the Scottish [lox]). Consequently, the English language is better suited for lip-reading than other languages such as Arabic, which has many laryngeal sounds.

Conversely, it is impossible to know if sounds are voiced or not. For this reason, [pɒt] and [bɒd] would look the same, even though they have different meanings. This causes a problem for all of the fricatives, which generally form 11 of the 39 possible place and manner combinations of the IPA chart.

Additionally, it is not possible to detect whether a sound is nasalised or not. This is due to the velum, which lowers and opens the nasal cavity, being located far back in the mouth. A consequence of this is [m] looks visually the same as [b]. So [meɪbi] and [beɪbi] would look identical and would therefore depend on context.

Surprisingly, it is not just audio which factors into what we hear. The McGurk effect demonstrates the importance of visuals in processing speech. In the first stage of this experiment, participants are shown a video of someone saying [bɑ]. The audio is kept constant for the next test but the video is switched to someone saying [fɑ]. The participant will then hear [fɑ] instead despite it being the same audio. This test works even when the subject is informed that it is the same audio file, and this suggests that our hearing is very dependent on our sight. It is therefore reasonable to suggest that some sounds are dependent on the visual that accompanies it.

Even though 30% of speech production may only be visible, there are many things which can boost this figure. For example it would be a lot easier to estimate what someone is likely to say at work or home, rather than on a first date. This is due to familiarity with the person’s daily vocabulary and how they produce speech. In addition, deciphering the grammar of any specific word is trivial, as the structure of the rest of the sentence provides information about what the bound morphemes should be.

However, English has a large number of visemes - words which visually look the same but have different meanings. This phenomenon occurs for one of three reasons. Firstly, the areas in the mouth could be very similar or far back enough that they are indistinguishable. Secondly, almost all producible sounds have a complement in the form of its voiced/unvoiced counterpart. For example [t] and [d] are produced in the exact same manner and area but [d] is voiced. Finally, plosive phonemes look identical to nasal phonemes in the same area. For example, [d] and [n] are a pair. Out of the 41 pulmonic IPA consonant blocks with symbols, 24 of them have a complementary pair of voiced or nasal sounds. This is used to great effect in the common schoolyard example “elephant shoes”. This is infamous for when it is not audible it looks very similar to “I love you”.

A	B
I love you	elephant shoes
aɪ lʌv ju	ɛlɪfʌnt ʃuz
aɪ lʌ v ju	ɛ lɪ fʌnt ʃuz

The reason this example works is as follows. The first sound of A is a glide from [a] to [ɪ], which moves over the area of [ɛ] in B. The next syllable in both is a lateral sound with a vowel. Following this is a labiodental fricative which has a different quality in A and B - which differs only in voicing. Although in B this is followed by an alveolar nasal and plosive. This is the only place where the two examples differ but the difference is negligible as alveolar sounds are only partially visible. Finally, in A we have a palatal approximant versus a postalveolar fricative in B. These two areas are millimetres apart so are indistinguishable. In B there is a final [z] which is in the same area and is not significant. Therefore all the sounds that are visible are the same and the distinguishing factors are derived from the non-visible parts. Consequently it is possible to confuse the two phrases when lip-reading.

However, it is not possible to combine any visually similar series of letters in English, and many combinations are impossible. For example [peɪgɑn] is an English word and would look indistinguishable to the word [beɪŋɑd] but this is not a valid series of letters in English.

Fortunately, it can be easier to distinguish similar words as they will usually have different syllable lengths and emphasis. By way of illustration, it would usually be difficult to distinguish [əɡɹiːd] and [ɡɹid], as [ə] is a back vowel which would be easy to miss. However, [əɡɹiːd] has emphasis placed on its vowel which means that it is held for longer. Even if it would be obvious in context, this nevertheless makes it easier to distinguish these words.

As consonants are more discernible than vowels, it would imply that it would be easier to lip-read languages which feature more consonants. Maddieson (2013) groups languages into three main classes: those with only CV and V combinations; those with up to three consonants and a vowel; and languages which allow much more consonants. English falls into the last category. Maddieson ratifies this with many examples including the word [stɹɛŋkθs] which has seven consonants and only one vowel. This means that English has more words that are visually distinct e.g. [ʌmbɹɛlə] than languages which only allow certain combinations. Hence, English is especially suited for lip-reading as it allows for a large ratio of consonants to vowels.

Therefore, lip-reading is something which is hard to do for single words but becomes much easier with context and length. It may be hard to distinguish words with similar mouth shapes, but it is easier when you have surrounding information to show what could logically follow. But if the task is to lip-read from someone that you are not familiar with, or if there are obstructions, it will be almost impossible to discern with 100% accuracy. English does make this easier with its large number of frontal consonants and high number of consonants to vowels. Yet still it is only with great context that it is possible to parse language through lip-reading.

References

Mcgurk, Harry, and John Macdonald. “Hearing Lips and Seeing Voices.” Nature, vol. 264, no. 5588, 1976, pp. 746–748., https://doi.org/10.1038/264746a0.
Ian Maddieson. 2013. Syllable Structure. In: Dryer, Matthew S. & Haspelmath, Martin (eds.) The World Atlas of Language Structures Online. Leipzig: Max Planck Institute for Evolutionary Anthropology. (Available online at https://wals.info/chapter/12, Accessed on 2017-11-13.)
International Phonetic Association - Kiel, revised to 2015 https://www.internationalphoneticassociation.org/content/full-ipa-chart.
Warren Maguire. 2017. Phonetics Lecture Slides Linguistics and English Language 1A (2017-2018) The University of Edinburgh
“Is Seeing Believing” Horizon. BBC. 18 Oct. 2010. Television.
“The International Phonetic Alphabet (Revised to 2005).” Seeing Speech: IPA Charts, https://www.seeingspeech.ac.uk/ipachart/display.php?chart=4&datatype=1&speaker=1