Pointing: The Problem with Hebrew TTS
When we started to work on putting together a touchscreen keyboard for Bow to use, it was understood that the English version would come first. Why? Is it because English is an international language that many people across the globe can speak? No. Is it because most of our volunteers at Project Bow are English speakers, but very few are Hebrew speakers? No. Is it because Bow prefers English? No, actually Bow prefers Hebrew.
The reason we all agreed we would work on getting it done in English first was because doing TTS for Hebrew is a bigger technical problem.
Why is it a bigger technical problem? Because of the vowels.
Are Hebrew vowels harder to pronounce? No. Hebrew vowels are much simpler than English ones. It's just that when we write, we don't normally specify what the vowels are in a word. And yet what the vowels are what determines what the word is.
Sound confusing? I'll explain.
In Hebrew, we normally specify only the consonants, with a few exceptions. It's still easy for us to read, because when you take in a whole sentence, it's almost completely unambiguous what was meant to be said.
It may be hard for an English speaker to imagine this, but actually English isn't written phonetically, so when we read English we first identify the word, then we decide how to pronounce it. Let me give you an example. In the following two sentences, how do you know how to pronounce the vowel in the word "read"?
- I read the book yesterday.
- I love to read.
First we read the whole sentence. Then we decide whether it's a past tense or some other form of the verb. Then we know whether the vowel is long or short. That's also how we decide what the vowels are in Hebrew words that are written without pointing. We read the sentence, then we know what the word is. Once we identify the word, we know how to pronounce it. Native speakers do this naturally in the blink of an eye.
In the case of Bow's writing, there is an added complication. Bow doesn't have a spacebar, so he doesn't specify where one word ends and another begins. For the computer program that is to turn Bow's text into speech, this creates the requirement that before identifying the word, we must break up the text stream into separate words. To give an example in English, it would be something like this:
As a speaker of English, you don't really have much trouble breaking the sentence into words. It's pretty unambiguous even without the spaces. But how would you translate what you naturally do into an algorithm that even a computer could use?
In this example, it's pretty easy. The computer has a corpus of words in English. That means a list of words. It can compare the words in the list with sequences in the string of letters above. If it finds a word, it can then go on to parse the remaining string into words. Let's say it's satisfied with the shortest possible word it can find. In a first pass, it would parse the string of letters like this:
and it would have "napple" left over. (This is assuming the corpus doesn't have the word "wan", just for purposes of simplification.) Now when it checks the corpus, the computer won't find "napple", so it's going to have to try for a second pass. It will look for a longer third word and will find "an". Now the sentence will resolve itself into the following words:
If the corpus had had the word "wan" in it, this algorithm of looking for the shortest possible word might not have worked. We might have ended with the following set of words:
A speaker of English knows that "I wan tan apple" does not make a sentence, but a computer without access to syntactic and semantic knowledge might not know that. However, since both "I want an apple" and "I wan tan apple" sound almost the same, it's likely that an English speaker who heard the computer pronounce "I wan tan apple" would think that he heard "I want an apple", and hence there would not be any practical problem with this superficial form of parsing for purposes of TTS.
In the case of Hebrew, where most vowels are unspecified, we might expect more occasions for misunderstanding.
Last week, I was at a linguistics conference where I presented a paper on standards of proof in ape language studies. The paper was well received, and I came home thinking that this week I would work on a flow chart of the TTS problem in Hebrew. The first step would be to decide how to divide strings of letters into words.
It turned out that if we limit ourselves to a corpus of Bow's Hebrew vocabulary, rather than all Hebrew, that the resolution of "I want an apple" in Hebrew into words from letters is just as unproblematic as the English sentence. So I was going to put together a flow chart of the same algorithm as I described above with the English example: take the smallest sequence of letters that spells a word in the corpus, put it aside, then apply the same to the remaining letters, until you end up with all the letters divided into words. If it doesn't resolve itself on the first pass, try as many passes using longer words, as are necessary to get every letter in sequence into a word.
The above is a sloppy description of what I mean, and I needed a more accurate way to describe the process. The first thing I did was to look for free software that would allow me to put together a flow chart.
I downloaded a version of Smartdraw that would allow me to use it free of charge for seven days. As I was struggling with this software, Bow started blowing raspberries. I was discussing the algorithm with myself as I tried to put together the flowchart, and Bow became increasingly upset. So I went to see what the problem was. "What do you want?" I asked impatiently.
He took my hand and spelled out the following:
Although I was not aware that Bow provided directions for where the spaces would go, I immediately knew that this was what he had said:
כל אדם מתנסה בבעיות תקשורת
"Every person experiences communication problems."
This was an odd thing for him to say. It sounded like a fortune cookie generalization. "Why did you say that?" I asked.
"Because Bow is smart," he replied in Hebrew.
I went back to my flow chart. I was having a lot of problems with it. Bow went back to blowing raspberries. When I asked him what the problem was, I got the same reply: "Every person experiences communication problems."
"Why do you keep saying that?"
"Because Bow is smart."
I tried to go back to my work, but this was starting to bug me. Was he trying to tell me something? It was an odd sentence. It was much too formal and general. Had I misinterpreted it? Was he trying to say something else?
Suddenly, it occurred to me that I should try my algorithm on this sentence. Could it resolve into a completely different sentence? Trying for the longest possible list of words, I found that the string of letters could be resolved as follows into a list of words:
כלא דם מת נס ה ב בעיות תקשורת
"Prison (or imprisoned), blood, dead, miracle, give, communication problems."
This wasn't strictly by the algorithm I had written, but it was definitely a possible way to parse the sequence. However, the way I originally parsed it, with a two letter word ("every") is the way my algorithm would have processed it. It would also have cut the third word off at two letters, spelling out "dead":
כל אדם מת נסה בבעיות תקשורת
"Every dead person tried (his hand at) communication problems."
That doesn't make much sense, but it is a grammatical sentence. How did I know Bow didn't mean that, instead?
Well, I knew because....
Bow kept blowing raspberries. "What is it?"
Again, with that same sequence. "Every person experiences communication problems."
"Why are you saying this?"
"Because Bow is smart."
"Bow, is this some kind of puzzle? You know this sentence doesn't make another sensible sentence... You couldn't possibly have been talking about dead people."
He smiled at me, took my hand, and spelled out the following:
This time it was clear to me, in an instant, that the words divided like this:
כל אדם מת נסה לקבל אוכל
"Every dead person tried to get food." It had to be divided that way for grammatical reasons, otherwise the sentence would have had too many verbs in a row. So it wasn't strictly speaking the semantics that determined the word division.
"Hmm." I looked at Bow. He was looking at me with a smile on his face, waiting for this to sink in. "So what you're saying is that it could be divided either way, and I know which way you mean, but not because of the words that I recognize..."
He took my hand and spelled out: "Every person experiences communication difficulties."
"Bow, why do you keep saying that!"
"Because I heard what Mommy was trying to do."
"So you don't think the algorithm in my flow chart will work?"
For a while I was really floored by all this. Then I realized that none of it mattered. Why? Because actually I had been reading the words out loud before the sentence was completed. Bow had been cluing me in all along to where the words had ended by the slight pause that he made after each word!
Which just goes to show that while Bow is indeed very smart, he is not really the best person to take advice from when trying to come up with an algorithm for Hebrew TTS!
(c) 2009 Aya Katz
Qaryan: a free open source application for Hebrew TTS
- Qaryan Hebrew TTS | Get Qaryan Hebrew TTS at SourceForge.net
Get Qaryan Hebrew TTS at SourceForge.net. Fast, secure and free downloads from the largest Open Source applications and software directory