Speech Recognition - Ready for Prime Time?
by Jarred Walton on April 21, 2006 9:00 AM EST- Posted in
- Smartphones
- Mobile
Closing Thoughts
At first glance, both of these speech recognition packages appear pretty reasonable. Dragon is more accurate in transcribe mode, but it also requires more processing time. Both also manage to offer better than 90% accuracy, but as stated earlier that really isn't that great. Having used speech-recognition for several months now, I would say that 95% accuracy is the bare minimum you want to achieve, and more is better. If you already have Microsoft Office 2003, the performance offered might be enough to keep you happy. I can't say that I would be happy with it, unfortunately.
It may simply be that I started with Dragon NaturallySpeaking, but so far every time I've tried to use Microsoft's speech tool, I've been frustrated with the interface. Microsoft does appear to do better when you start speaking rapidly, but I generally only speak in a clause or at most a sentence at a time, and the Microsoft speech engine doesn't seem to do as well with that style of delivery.
Going back to my earlier analogy of the mouse wheel, I can't help but feel that it's something of the same thing. Having experienced the way Dragon does things, the current Microsoft interface is a poor substitute. It almost feels as though the utility reached the point where it was "good enough" and has failed to progress from there. The accuracy is fine, but dealing with the errors and training the tool to properly recognize words for future use is unintuitive at best - I didn't even have to crack the manual for DNS until I wanted to get into some specialized commands! That said, Windows Vista and the new Office Live are rumored to have better speech support, so I will definitely look into those in the future.
In case you were wondering, the vast majority of this article was written using Dragon NaturallySpeaking. There are still many errors being made by the software, but there are far more errors being made by the user. Basically, the software works best if you can think and speak in complete phrases/clauses, as that gives the software a better chance to recognize words from context. Any stuttering, slight pauses, slurring, etc. can dramatically impact the accuracy. It's also important to enunciate your words -- it will certainly help improve the accuracy if you can do so.
I find DNS works very well for my purposes, but I'm sure there are people out there that will find it less than ideal. There are also various language packs available, including several dialects of English, but I can't say how well any of them work from personal experience. I don't really feel that I can say a lot about getting the most out of speech recognition from Microsoft, so the remainder of my comments come from my use of Dragon NaturallySpeaking.
Thoughts on Dragon NaturallySpeaking
Note: I've added some additional commentary on the subject based on email conversations.
One of the questions that many people have is what's the best microphone setup to use? I use a Plantronics headset that I picked up online for about $30, and it hasn't given me any cause for concern. The soft padded earphones are definitely a plus if you're going to use your headset for long periods of time. I've also used a Logitech headset, and it worked fine, but it's not as comfortable for extended use. The location of the microphone -- somewhat closer to your ear than to your mouth -- also seems to make it less appropriate for noisy environments. There are nicer microphones out there, including models that feature active noise cancellation (ANC), and they could conceivably help -- especially in noisy environments. So far, I at least have not found them to be necessary for my needs.
As far as sound hardware goes, I've used integrated Realtek ALC655 audio, integrated Creative Live! 24-bit, and a discrete Creative Audigy 2 ZS card. The ALC655 definitely has some static and popping noises, but it didn't seem to affect speech recognition in any way that I could see. The quality of integrated audio varies by motherboard, of course, but I would suggest you try out whenever you currently have first. Whatever sound card/chipset you're using ought to be sufficient; if it's not, you could try a USB sound pod or upgrade to a nicer sound card that has a better signal to noise ratio. (My Audigy 2ZS works great - that's what was used for recording the sample audio files.)
There's also a recommendation to make sure you're in a quiet environment in order to get best results. I'm not exactly sure what qualifies as quiet, but my living room with an ambient noise level of 50 to 60 dB doesn't appear to present any difficulties. On the other hand, I did try using speech recognition in a data center that had an ambient noise level of over 75 dB. I received a warning that the noise level was too high during the microphone configuration process, but I did have the option to continue. I did so, but within minutes I had to admit defeat. I could either shout at my microphone, or else I could try to get by with less than 50% accuracy rates.
I would say there's a reasonable chance that active noise cancellation microphones could handle such an environment, but I didn't have a need to actually make that investment. In fact, several SRS experts have emailed me and recommended ANC for exactly that sort of situation. It should allow you to use speech recognition even in noisy environments, and it can also help improve accuracy even in quieter environments. If you're serious about using such software, it's worth a look. However, few locations are as noisy as a data center -- trade shows certainly are close -- so most people won't have to worry about excessive noise. Most office environments should be fine, provided you don't mind people overhearing you.
The one area that continues to plague DNS in terms of recognition errors is acronyms. Some do fine, but there are many that continue to be recognized incorrectly, even after multiple attempts to train the software. For example, "Athlon 64 ex 240 800+" is what I get every time I want "Athlon 64 X2 4800+" -- that's when I say "Athlon 64 X2 forty-eight-hundred plus-sign". I can normally get the proper text if I say "Athlon 64 X2 four-thousand eight-hundred plus-sign", but I still frequently get "ask to" or "ax to" instead of "X2". My recent laptop review also generated a lot of "and/end 1710" instead of "M1710", despite my best attempts to train the software. There's a solution for this: creating custom "macros" for acronyms you use a lot. The only difficulty is that you have to spend the time initially to set up the macros, but for long-term use its definitely recommended.
My final comment for now is that the functionality provided by Dragon NaturallySpeaking is far better in Word or DragonPad (the integrated rich text editor that comes with DNS) than in most other applications. If you're using Word or DragonPad, it's relatively simple to go back and correct errors without touching the keyboard. All you have to do is say "select XXX" where XXX is a word or phrase that occurs in the document. DNS will generally select the nearest occurrence of that phrase, and you're presented with a list of choices for correcting it, or you can just speak the new text you want to replace the original text. This is one of those intuitive things I was talking about that Microsoft currently lacks.
There are a few problems with this system. The biggest is that outside of Word and DragonPad, the instant you touch the keyboard or switch to another application, DNS loses all knowledge about any of the previous text. This happens a lot with web browsers and instant messaging clients -- I surf almost exclusively with Firefox, so I can't say whether or not this holds true for Internet Explorer. Another problem is that sometimes the selection gets off by one character, so you end up deleting the last character of the previous word and getting an extra space on the other side. (This only happens outside of Word/DragonPad, as far as I can tell.) I've also had a few occasions where the system goes into "slow motion" when I try to make a correction: the text will start to be selected one character at a time, at a rate of about two characters per second, and then once all the text is selected it has to be unselected again one character at a time. If I'm trying to select a larger phrase, I'm best off just walking away from the computer for a few minutes. (Screaming and yelling at my system doesn't help, unfortunately.) Thankfully, both of those glitches only occur rarely.
Hopefully, some of you will have found this article to be an interesting look at a technology that has continued to improve through the years. It's still not perfect, but speech recognition software has become a regular part of my daily routine. There are certainly people out there who type more than I do, and I would definitely recommend that many of them take a look at speech recognition software. If you happen to be experiencing some RSI/carpal tunnel issues caused by typing, that recommendation increases a hundredfold. I'm certainly no doctor, but the expression "No pain, no gain" isn't always true; some types of pain are your body's way of telling you to knock it off.
If you have any further questions about my experience with speech recognition, send them my way. I don't think I'm a bona fide expert on the subject, but I'll be happy to offer some help if I can.
At first glance, both of these speech recognition packages appear pretty reasonable. Dragon is more accurate in transcribe mode, but it also requires more processing time. Both also manage to offer better than 90% accuracy, but as stated earlier that really isn't that great. Having used speech-recognition for several months now, I would say that 95% accuracy is the bare minimum you want to achieve, and more is better. If you already have Microsoft Office 2003, the performance offered might be enough to keep you happy. I can't say that I would be happy with it, unfortunately.
It may simply be that I started with Dragon NaturallySpeaking, but so far every time I've tried to use Microsoft's speech tool, I've been frustrated with the interface. Microsoft does appear to do better when you start speaking rapidly, but I generally only speak in a clause or at most a sentence at a time, and the Microsoft speech engine doesn't seem to do as well with that style of delivery.
Going back to my earlier analogy of the mouse wheel, I can't help but feel that it's something of the same thing. Having experienced the way Dragon does things, the current Microsoft interface is a poor substitute. It almost feels as though the utility reached the point where it was "good enough" and has failed to progress from there. The accuracy is fine, but dealing with the errors and training the tool to properly recognize words for future use is unintuitive at best - I didn't even have to crack the manual for DNS until I wanted to get into some specialized commands! That said, Windows Vista and the new Office Live are rumored to have better speech support, so I will definitely look into those in the future.
In case you were wondering, the vast majority of this article was written using Dragon NaturallySpeaking. There are still many errors being made by the software, but there are far more errors being made by the user. Basically, the software works best if you can think and speak in complete phrases/clauses, as that gives the software a better chance to recognize words from context. Any stuttering, slight pauses, slurring, etc. can dramatically impact the accuracy. It's also important to enunciate your words -- it will certainly help improve the accuracy if you can do so.
I find DNS works very well for my purposes, but I'm sure there are people out there that will find it less than ideal. There are also various language packs available, including several dialects of English, but I can't say how well any of them work from personal experience. I don't really feel that I can say a lot about getting the most out of speech recognition from Microsoft, so the remainder of my comments come from my use of Dragon NaturallySpeaking.
Thoughts on Dragon NaturallySpeaking
Note: I've added some additional commentary on the subject based on email conversations.
One of the questions that many people have is what's the best microphone setup to use? I use a Plantronics headset that I picked up online for about $30, and it hasn't given me any cause for concern. The soft padded earphones are definitely a plus if you're going to use your headset for long periods of time. I've also used a Logitech headset, and it worked fine, but it's not as comfortable for extended use. The location of the microphone -- somewhat closer to your ear than to your mouth -- also seems to make it less appropriate for noisy environments. There are nicer microphones out there, including models that feature active noise cancellation (ANC), and they could conceivably help -- especially in noisy environments. So far, I at least have not found them to be necessary for my needs.
As far as sound hardware goes, I've used integrated Realtek ALC655 audio, integrated Creative Live! 24-bit, and a discrete Creative Audigy 2 ZS card. The ALC655 definitely has some static and popping noises, but it didn't seem to affect speech recognition in any way that I could see. The quality of integrated audio varies by motherboard, of course, but I would suggest you try out whenever you currently have first. Whatever sound card/chipset you're using ought to be sufficient; if it's not, you could try a USB sound pod or upgrade to a nicer sound card that has a better signal to noise ratio. (My Audigy 2ZS works great - that's what was used for recording the sample audio files.)
There's also a recommendation to make sure you're in a quiet environment in order to get best results. I'm not exactly sure what qualifies as quiet, but my living room with an ambient noise level of 50 to 60 dB doesn't appear to present any difficulties. On the other hand, I did try using speech recognition in a data center that had an ambient noise level of over 75 dB. I received a warning that the noise level was too high during the microphone configuration process, but I did have the option to continue. I did so, but within minutes I had to admit defeat. I could either shout at my microphone, or else I could try to get by with less than 50% accuracy rates.
I would say there's a reasonable chance that active noise cancellation microphones could handle such an environment, but I didn't have a need to actually make that investment. In fact, several SRS experts have emailed me and recommended ANC for exactly that sort of situation. It should allow you to use speech recognition even in noisy environments, and it can also help improve accuracy even in quieter environments. If you're serious about using such software, it's worth a look. However, few locations are as noisy as a data center -- trade shows certainly are close -- so most people won't have to worry about excessive noise. Most office environments should be fine, provided you don't mind people overhearing you.
The one area that continues to plague DNS in terms of recognition errors is acronyms. Some do fine, but there are many that continue to be recognized incorrectly, even after multiple attempts to train the software. For example, "Athlon 64 ex 240 800+" is what I get every time I want "Athlon 64 X2 4800+" -- that's when I say "Athlon 64 X2 forty-eight-hundred plus-sign". I can normally get the proper text if I say "Athlon 64 X2 four-thousand eight-hundred plus-sign", but I still frequently get "ask to" or "ax to" instead of "X2". My recent laptop review also generated a lot of "and/end 1710" instead of "M1710", despite my best attempts to train the software. There's a solution for this: creating custom "macros" for acronyms you use a lot. The only difficulty is that you have to spend the time initially to set up the macros, but for long-term use its definitely recommended.
My final comment for now is that the functionality provided by Dragon NaturallySpeaking is far better in Word or DragonPad (the integrated rich text editor that comes with DNS) than in most other applications. If you're using Word or DragonPad, it's relatively simple to go back and correct errors without touching the keyboard. All you have to do is say "select XXX" where XXX is a word or phrase that occurs in the document. DNS will generally select the nearest occurrence of that phrase, and you're presented with a list of choices for correcting it, or you can just speak the new text you want to replace the original text. This is one of those intuitive things I was talking about that Microsoft currently lacks.
There are a few problems with this system. The biggest is that outside of Word and DragonPad, the instant you touch the keyboard or switch to another application, DNS loses all knowledge about any of the previous text. This happens a lot with web browsers and instant messaging clients -- I surf almost exclusively with Firefox, so I can't say whether or not this holds true for Internet Explorer. Another problem is that sometimes the selection gets off by one character, so you end up deleting the last character of the previous word and getting an extra space on the other side. (This only happens outside of Word/DragonPad, as far as I can tell.) I've also had a few occasions where the system goes into "slow motion" when I try to make a correction: the text will start to be selected one character at a time, at a rate of about two characters per second, and then once all the text is selected it has to be unselected again one character at a time. If I'm trying to select a larger phrase, I'm best off just walking away from the computer for a few minutes. (Screaming and yelling at my system doesn't help, unfortunately.) Thankfully, both of those glitches only occur rarely.
Hopefully, some of you will have found this article to be an interesting look at a technology that has continued to improve through the years. It's still not perfect, but speech recognition software has become a regular part of my daily routine. There are certainly people out there who type more than I do, and I would definitely recommend that many of them take a look at speech recognition software. If you happen to be experiencing some RSI/carpal tunnel issues caused by typing, that recommendation increases a hundredfold. I'm certainly no doctor, but the expression "No pain, no gain" isn't always true; some types of pain are your body's way of telling you to knock it off.
If you have any further questions about my experience with speech recognition, send them my way. I don't think I'm a bona fide expert on the subject, but I'll be happy to offer some help if I can.
38 Comments
View All Comments
JarredWalton - Sunday, April 23, 2006 - link
Isn't there some comedy routine by an older gentleman that does the whole "verbalize punctuation" shtick? One of the things I might look at in the follow-up article is showing how Dragon does when turning on automatic punctuation. It will attempt to insert periods, commas, and question marks (at least, I think it does question marks) depending on how you speak the text. Obviously, that means you have to be a lot more careful when reading/dictating.I found it more useful to manually dictate my punctuation, since on frequent occasions I will pause midsentence to try and think what I want to say -- or because of some interruption. Basically, as a writer, punctuation is something that I take pretty seriously. DNS does pretty well with getting it right, but it also makes plenty of mistakes.
Admiral Ackbar - Monday, April 24, 2006 - link
Victor Borge. Its called phonetic punctuation. It was one of the funniest things I have ever seen (I had the privelege of seeing him not long before he died).Actually though, it could work and its quicker than actually saying the word period or question mark.
JarredWalton - Tuesday, April 25, 2006 - link
I bet it takes a hell of a lot of practice, too! Especially if you want to speak at a reasonable clip. I remember laughing my butt off at Victor Borge's routine quite a few years ago. On the bright side, more people might learn how to use proper punctuation!You also have to worry about the speech recognition software starting to recognize random noises (like a cough) as actual dictation. That happens already, but usually Dragon is smart enough to realize that my cough was merely a loud noise. Sometimes I get the random "the" from it, though.
Tujan - Saturday, April 22, 2006 - link
I would be interested in knowing exactly what the program does. Something more acknowledged towards its features,interaction ect. Rather than a somewhat comparison between two programs - a somewhat benchmark.For example - you mention command mode. But dont get any further involved with what that encapsulates. That alone,has its limitations Im sure. Yet Im am also sure that many might want to know exactly what it is about. For example Start-My Documents-FolderName-Open...and so on. Is this how it works ? Or something like the HTPC scenario in wich you Query your favorite TV show - "Channel-channel name",..Or 'program name-file name-open'' . For the HTPC.
Everybody should know what a vaccuum cleaner can do for you. Ya know. But what can you do for your vacuum cleaner.
I imagine (note imagine'yes),given speech recognistion what well enough along,you could utilize a command line interface,and programmers would be able to program more quickly,and easily. Other than having your vacuum cleaner attack you ya know,you could do something like 'Dir - listing of directories. Or MD - make directories.
Dont know any programming code,so anything other than exampling DOS command line.STill you could see what Im getting at. Program your HTML for example.
But within the Windows environment,you could ask how well the program takes commands,and multitasks. Since you could use the wave file to do this. and so on.
Im just curious. Dont see a lot of interesting software reviews dealing with the nuts and bolts of the application itself lately.
Try a ram drive with that - take the chains off maybe ?
Ardemus - Friday, April 21, 2006 - link
1) How was the software trained? Were you using "normal" or "dictation" speech paterns?2) Dragon may do much better with a wav over a real time system because it can read ahead and analyze the whole file.
3) Does dragon give up resources when other applications ask for them?
4) What sort of errors were made? How many errors are there after a spell and gramar check in MS word?
5) Can you correct the errors in each program and scan again, to measure the improvement?
6) I've heard that you can overstress and damage your vocal cords through speech recognition (RSI of the voice). Have you researched that?
7) How often did both packages make the same mistakes? If you ran it through both packages in real time minimal mode, then DNS in several different speeds, could you run an algorythm to on the different results to increase accuracy?
Nick Burger
JarredWalton - Friday, April 21, 2006 - link
1 -- Both were trained in the same manner, basically me speaking the text, but doing my best to enunciate words a little better than I might do in the real world. Besides, good fiction is a useful skill to have, particularly if you're speaking with business people.2 -- That's entirely possible. One of the odd things is that the accuracy shown in my dictation benchmarks doesn't seem to correspond with my own personal experience of trying to use the software. It may simply be the way that I speak when trying to write articles, but I find that Microsoft is far worse in normal use. That's not a very scientific method, but I can't emphasize enough how much more difficult I find Microsoft's speech interface is to use.
3 -- Dragon runs as a normal priority process, and when you're dictating with the accuracy set to "medium" it uses 20 to 50% of the processor time (on a single core Athlon 64 2.4 GHz). The memory footprint is pretty large, at about 150 to 200 MB. As far as I can tell, it will not use more than 200 MB -- during testing, I watched RAM usage on the "maximum accuracy" configuration, because I was curious to see if the switch from 1 GB on my old system to 2 GB on my new system would help. It did not. (the total size of my database/voice files is currently just over 300 MB.)
I also noticed on my old system that Dragon requires a fair amount of hard disk access. I was copying several gigabytes of data from one computer to another computer (over gigabit ethernet) and Dragon's responsiveness dropped way off. It was still accurate, but rather than speaking and seeing the text a second or so later, there was a four or five second pause for most sentences.
4 -- I included a link to a zip file in the article for anyone interested in looking at specific errors. The text files were compared using WinDiff, and I manually counted errors. (I was somewhat lenient, in that I allowed "speech-recognition" to match "speech recognition" -- stuff like that.)
5 -- Dragon has definitely been "trained" on the document. Microsoft seems to do its own thing in terms of training, so all I could do is make sure that all of the words used were known by the speech engine. When you make an error using Microsoft's tool, as far as I know you have to correct with the keyboard. You can't just tell it to select the misinterpreted words and provide the correct interpretation. Perhaps it's possible to switch to command mode, tell the application to select something, then switch to dictation mode and give the correct spelling... at that point, you're far better off using the mouse and keyboard, and if you can't use those then you're much better off using Dragon's interface.
6 -- Ithet's entirely possible, and laryngitis certainly doesn't help speech recognition at all. You definitely don't want to get in the habit of speaking really loudly, so it's best to train the software in a somewhat subdued voice (in my opinion). I would say the most important thing is to do everything in moderation; sitting at a computer dictating for 12 hours a day is going to be just as harmful in the long run as sitting at a computer typing 12 hours a day.
bobsmith1492 - Saturday, April 22, 2006 - link
"Besides, good fiction is a good skill to have when... ":P Kind of like Isaac Asimov?
JarredWalton - Saturday, April 22, 2006 - link
See what I get for not proofing carefully? LOL - that's the type of error I get most of the time. "A" for "the" is another common one.Gioron - Friday, April 21, 2006 - link
My brother swears by DNS, but using it myself and watching him use it I just can't stand going that slow. I've gotten to the point where I can type much faster than the speach recognition can handle it, and stopping to correct it just slows things down to a painful level. Of course, I'd probably have to learn to live with it if my wrists started bothering me, but until then...And then there's this bash.org quote:
http://www.bash.org/?34776">http://www.bash.org/?34776
<www666> this is so cool I'm typing with Dragon NaturallySpeaking in mIrc
<www666> no more typing
<LameLLama> www: try "thlash exit"
*** www666 has quit IRC (Leaving)
*** www666 ([email protected]int.ca) has joined #visualbasic
<www666> Hugh Masters
<www666> you basterdes
hans007 - Friday, April 21, 2006 - link
i used speech recognition with office xp when it came out. that was awful.my acura navigation has speech recognition which is also not well, that useful, its still easier to use buttons.
i honestly think it will never be better than just buttons.