Speech Recognition - Ready for Prime Time?
by Jarred Walton on April 21, 2006 9:00 AM EST- Posted in
- Smartphones
- Mobile
Accuracy Testing
In order to try and keep this article coherent, I decided to cut back on the number of test results and reporting. I started doing some comparisons of trained versus untrained installations, but untrained installations are really a temporary solution, since the software will learn as you use it. I have my Dragon installation that I've been using for a while, so that side of the equation is covered. I haven't used Microsoft's speech recognition package nearly as much, but I wanted to make sure I gave it a reasonable chance, so I went through additional training sessions with Office 2003. I also opened several of my articles and had the speech engine learn from their content.
One major advantage of DNS is that it will scan your My Documents folder when you first configure it, and as far as I can tell it adds most of the words in your text documents into its recognition engine. Microsoft Office's speech tool can do this as well, but you have to do it manually, one document at a time. I wanted to be fair to both products, but eventually my patience with Microsoft Office 2003 ran out, so it's not as "trained" as DNS8.
Both Dragon and Microsoft Office have the ability to adjust the speed of speech recognition against accuracy, so I tested performance and accuracy at numerous settings. For Dragon, there are essentially six settings, ranging from minimum accuracy to maximum accuracy. The slider can be adjusted in smaller increments, but if you click in the slider bar it will jump between six positions, with each one bringing a moderate change in performance, and possibly a change in accuracy.
I tested at all six settings, but I'm only going to report results for the minimum, medium, and maximum accuracy scores in the charts. Dragon also has the ability to transcribe a recording directly from a WAV file at maximum speed, so I'll include a separate chart for that. Microsoft's speech engine also has a linear slider, but I chose to limit testing to maximum accuracy, minimum accuracy, as well as the middle value. If you would like to see the other test results, the text is available in this Zip file (1 MB).
At the request of some readers, I have also made the MP3 files available for download. (Don't make fun of my voice recordings without making some of your own, though!)
Precise Dictation (5.3MB)
Natural/Rapid Dictation (4.4 MB)
All of these tests were performed on the X2 system with the "trained" speech profiles. I would like to try to train Microsoft's tool more, but it just doesn't have a very intuitive interface. When you say a word or phrase that DNS doesn't recognize, you simply say "spell that" and provide the correct spelling. In most instances, that will allow DNS to recognize the word(s) in the future. This is particularly useful for names of family/friends/associates/etc. Acronyms can also be trained in this manner, but many acronyms sound similar to other standard words, and they definitely cause recognition difficulties. For example, "Athlon X2" still often comes out as "Athlon axe two" and "SATA" (pronounced, not spelled out) is still recognized as "say to" or "say that".
My experience with using Microsoft's speech tool is that it is best used for rough drafts and that you shouldn't worry about correcting errors initially. Once you've got the basic text in place, then you should go through and manually edit the errors. That's basically what Microsoft's training wizard tells you as well, so immediately their goals seem less ambitious - and thus their market is also more limited. Luckily, the text being dictated here isn't as complex that in some of my articles, so Microsoft does pretty well.
Dictation Accuracy
Both packages clearly meet the 90% or higher accuracy claims with practiced dictation. Once you get above 90%, though, every additional accuracy point becomes exponentially more difficult to acquire. With that in mind, the 96% accuracy achieved is impressive. The more specialized your dictation, the higher your chance for getting errors, but for general language both are capable. Somewhat interesting is that the maximum accuracy settings don't actually improve things in all cases. The lowest accuracy setting usually does the worst, but everything above the Medium setting (the default) seems to get both better and worse - some phrases are corrected, and others suddenly get misinterpreted.
The final thing to consider is that in all cases the computer is able to keep up with the user - though maximum accuracy on DNS barely manages to do so. The sound file being dictated here is 9:21 in length and contains 1181 words. At that rate, the software is handling 126 wpm, which is far faster than most people can type. If you're one of the "hunt and peck" crowd, and you find yourself in a situation where you have to do a lot more typing, you might seriously consider trying speech recognition.
Transcription Accuracy
Perhaps the fact that the transcription mode doesn't have to deal with commands and real-time interfacing with the user helps improve accuracy. It may also be that reading a WAV file directly as opposed to hearing it through a microphone helps accuracy. Regardless, it's clear that the transcription mode offers better accuracy than any of the dictation modes. If you're looking at reduction of errors, transcribing a file is 100% more accurate than dictating a file.
Realistically, transcription mode is only useful if you plan on dictating into a recording device while you're away from your computer. Otherwise, you simply spend time dictating a recording, have Dragon transcribe it, and then check for errors. The quality of your recording will also play a role, so if you're using a small portable music device with a tiny microphone, or if you're recording in a noisy environment, it's unlikely that you actually get better accuracy rates compared to sitting in front of a computer dictating into a headset.
There's also some question of how good the transcription mode would be at handling something like the minutes of a meeting, where you have numerous voices, accents, males and females, etc. Still, while you may not use the transcribe mode all that often, we would rather have it than not. Microsoft's speech SDK looks like it has the necessary hooks to allow transcription of a WAV file, but at present we were unable to find any utilities that take advantage of this feature.
In order to try and keep this article coherent, I decided to cut back on the number of test results and reporting. I started doing some comparisons of trained versus untrained installations, but untrained installations are really a temporary solution, since the software will learn as you use it. I have my Dragon installation that I've been using for a while, so that side of the equation is covered. I haven't used Microsoft's speech recognition package nearly as much, but I wanted to make sure I gave it a reasonable chance, so I went through additional training sessions with Office 2003. I also opened several of my articles and had the speech engine learn from their content.
One major advantage of DNS is that it will scan your My Documents folder when you first configure it, and as far as I can tell it adds most of the words in your text documents into its recognition engine. Microsoft Office's speech tool can do this as well, but you have to do it manually, one document at a time. I wanted to be fair to both products, but eventually my patience with Microsoft Office 2003 ran out, so it's not as "trained" as DNS8.
Both Dragon and Microsoft Office have the ability to adjust the speed of speech recognition against accuracy, so I tested performance and accuracy at numerous settings. For Dragon, there are essentially six settings, ranging from minimum accuracy to maximum accuracy. The slider can be adjusted in smaller increments, but if you click in the slider bar it will jump between six positions, with each one bringing a moderate change in performance, and possibly a change in accuracy.
I tested at all six settings, but I'm only going to report results for the minimum, medium, and maximum accuracy scores in the charts. Dragon also has the ability to transcribe a recording directly from a WAV file at maximum speed, so I'll include a separate chart for that. Microsoft's speech engine also has a linear slider, but I chose to limit testing to maximum accuracy, minimum accuracy, as well as the middle value. If you would like to see the other test results, the text is available in this Zip file (1 MB).
At the request of some readers, I have also made the MP3 files available for download. (Don't make fun of my voice recordings without making some of your own, though!)
Precise Dictation (5.3MB)
Natural/Rapid Dictation (4.4 MB)
All of these tests were performed on the X2 system with the "trained" speech profiles. I would like to try to train Microsoft's tool more, but it just doesn't have a very intuitive interface. When you say a word or phrase that DNS doesn't recognize, you simply say "spell that" and provide the correct spelling. In most instances, that will allow DNS to recognize the word(s) in the future. This is particularly useful for names of family/friends/associates/etc. Acronyms can also be trained in this manner, but many acronyms sound similar to other standard words, and they definitely cause recognition difficulties. For example, "Athlon X2" still often comes out as "Athlon axe two" and "SATA" (pronounced, not spelled out) is still recognized as "say to" or "say that".
My experience with using Microsoft's speech tool is that it is best used for rough drafts and that you shouldn't worry about correcting errors initially. Once you've got the basic text in place, then you should go through and manually edit the errors. That's basically what Microsoft's training wizard tells you as well, so immediately their goals seem less ambitious - and thus their market is also more limited. Luckily, the text being dictated here isn't as complex that in some of my articles, so Microsoft does pretty well.
Dictation Accuracy
Both packages clearly meet the 90% or higher accuracy claims with practiced dictation. Once you get above 90%, though, every additional accuracy point becomes exponentially more difficult to acquire. With that in mind, the 96% accuracy achieved is impressive. The more specialized your dictation, the higher your chance for getting errors, but for general language both are capable. Somewhat interesting is that the maximum accuracy settings don't actually improve things in all cases. The lowest accuracy setting usually does the worst, but everything above the Medium setting (the default) seems to get both better and worse - some phrases are corrected, and others suddenly get misinterpreted.
The final thing to consider is that in all cases the computer is able to keep up with the user - though maximum accuracy on DNS barely manages to do so. The sound file being dictated here is 9:21 in length and contains 1181 words. At that rate, the software is handling 126 wpm, which is far faster than most people can type. If you're one of the "hunt and peck" crowd, and you find yourself in a situation where you have to do a lot more typing, you might seriously consider trying speech recognition.
Transcription Accuracy
Perhaps the fact that the transcription mode doesn't have to deal with commands and real-time interfacing with the user helps improve accuracy. It may also be that reading a WAV file directly as opposed to hearing it through a microphone helps accuracy. Regardless, it's clear that the transcription mode offers better accuracy than any of the dictation modes. If you're looking at reduction of errors, transcribing a file is 100% more accurate than dictating a file.
Realistically, transcription mode is only useful if you plan on dictating into a recording device while you're away from your computer. Otherwise, you simply spend time dictating a recording, have Dragon transcribe it, and then check for errors. The quality of your recording will also play a role, so if you're using a small portable music device with a tiny microphone, or if you're recording in a noisy environment, it's unlikely that you actually get better accuracy rates compared to sitting in front of a computer dictating into a headset.
There's also some question of how good the transcription mode would be at handling something like the minutes of a meeting, where you have numerous voices, accents, males and females, etc. Still, while you may not use the transcribe mode all that often, we would rather have it than not. Microsoft's speech SDK looks like it has the necessary hooks to allow transcription of a WAV file, but at present we were unable to find any utilities that take advantage of this feature.
38 Comments
View All Comments
JarredWalton - Sunday, April 23, 2006 - link
Isn't there some comedy routine by an older gentleman that does the whole "verbalize punctuation" shtick? One of the things I might look at in the follow-up article is showing how Dragon does when turning on automatic punctuation. It will attempt to insert periods, commas, and question marks (at least, I think it does question marks) depending on how you speak the text. Obviously, that means you have to be a lot more careful when reading/dictating.I found it more useful to manually dictate my punctuation, since on frequent occasions I will pause midsentence to try and think what I want to say -- or because of some interruption. Basically, as a writer, punctuation is something that I take pretty seriously. DNS does pretty well with getting it right, but it also makes plenty of mistakes.
Admiral Ackbar - Monday, April 24, 2006 - link
Victor Borge. Its called phonetic punctuation. It was one of the funniest things I have ever seen (I had the privelege of seeing him not long before he died).Actually though, it could work and its quicker than actually saying the word period or question mark.
JarredWalton - Tuesday, April 25, 2006 - link
I bet it takes a hell of a lot of practice, too! Especially if you want to speak at a reasonable clip. I remember laughing my butt off at Victor Borge's routine quite a few years ago. On the bright side, more people might learn how to use proper punctuation!You also have to worry about the speech recognition software starting to recognize random noises (like a cough) as actual dictation. That happens already, but usually Dragon is smart enough to realize that my cough was merely a loud noise. Sometimes I get the random "the" from it, though.
Tujan - Saturday, April 22, 2006 - link
I would be interested in knowing exactly what the program does. Something more acknowledged towards its features,interaction ect. Rather than a somewhat comparison between two programs - a somewhat benchmark.For example - you mention command mode. But dont get any further involved with what that encapsulates. That alone,has its limitations Im sure. Yet Im am also sure that many might want to know exactly what it is about. For example Start-My Documents-FolderName-Open...and so on. Is this how it works ? Or something like the HTPC scenario in wich you Query your favorite TV show - "Channel-channel name",..Or 'program name-file name-open'' . For the HTPC.
Everybody should know what a vaccuum cleaner can do for you. Ya know. But what can you do for your vacuum cleaner.
I imagine (note imagine'yes),given speech recognistion what well enough along,you could utilize a command line interface,and programmers would be able to program more quickly,and easily. Other than having your vacuum cleaner attack you ya know,you could do something like 'Dir - listing of directories. Or MD - make directories.
Dont know any programming code,so anything other than exampling DOS command line.STill you could see what Im getting at. Program your HTML for example.
But within the Windows environment,you could ask how well the program takes commands,and multitasks. Since you could use the wave file to do this. and so on.
Im just curious. Dont see a lot of interesting software reviews dealing with the nuts and bolts of the application itself lately.
Try a ram drive with that - take the chains off maybe ?
Ardemus - Friday, April 21, 2006 - link
1) How was the software trained? Were you using "normal" or "dictation" speech paterns?2) Dragon may do much better with a wav over a real time system because it can read ahead and analyze the whole file.
3) Does dragon give up resources when other applications ask for them?
4) What sort of errors were made? How many errors are there after a spell and gramar check in MS word?
5) Can you correct the errors in each program and scan again, to measure the improvement?
6) I've heard that you can overstress and damage your vocal cords through speech recognition (RSI of the voice). Have you researched that?
7) How often did both packages make the same mistakes? If you ran it through both packages in real time minimal mode, then DNS in several different speeds, could you run an algorythm to on the different results to increase accuracy?
Nick Burger
JarredWalton - Friday, April 21, 2006 - link
1 -- Both were trained in the same manner, basically me speaking the text, but doing my best to enunciate words a little better than I might do in the real world. Besides, good fiction is a useful skill to have, particularly if you're speaking with business people.2 -- That's entirely possible. One of the odd things is that the accuracy shown in my dictation benchmarks doesn't seem to correspond with my own personal experience of trying to use the software. It may simply be the way that I speak when trying to write articles, but I find that Microsoft is far worse in normal use. That's not a very scientific method, but I can't emphasize enough how much more difficult I find Microsoft's speech interface is to use.
3 -- Dragon runs as a normal priority process, and when you're dictating with the accuracy set to "medium" it uses 20 to 50% of the processor time (on a single core Athlon 64 2.4 GHz). The memory footprint is pretty large, at about 150 to 200 MB. As far as I can tell, it will not use more than 200 MB -- during testing, I watched RAM usage on the "maximum accuracy" configuration, because I was curious to see if the switch from 1 GB on my old system to 2 GB on my new system would help. It did not. (the total size of my database/voice files is currently just over 300 MB.)
I also noticed on my old system that Dragon requires a fair amount of hard disk access. I was copying several gigabytes of data from one computer to another computer (over gigabit ethernet) and Dragon's responsiveness dropped way off. It was still accurate, but rather than speaking and seeing the text a second or so later, there was a four or five second pause for most sentences.
4 -- I included a link to a zip file in the article for anyone interested in looking at specific errors. The text files were compared using WinDiff, and I manually counted errors. (I was somewhat lenient, in that I allowed "speech-recognition" to match "speech recognition" -- stuff like that.)
5 -- Dragon has definitely been "trained" on the document. Microsoft seems to do its own thing in terms of training, so all I could do is make sure that all of the words used were known by the speech engine. When you make an error using Microsoft's tool, as far as I know you have to correct with the keyboard. You can't just tell it to select the misinterpreted words and provide the correct interpretation. Perhaps it's possible to switch to command mode, tell the application to select something, then switch to dictation mode and give the correct spelling... at that point, you're far better off using the mouse and keyboard, and if you can't use those then you're much better off using Dragon's interface.
6 -- Ithet's entirely possible, and laryngitis certainly doesn't help speech recognition at all. You definitely don't want to get in the habit of speaking really loudly, so it's best to train the software in a somewhat subdued voice (in my opinion). I would say the most important thing is to do everything in moderation; sitting at a computer dictating for 12 hours a day is going to be just as harmful in the long run as sitting at a computer typing 12 hours a day.
bobsmith1492 - Saturday, April 22, 2006 - link
"Besides, good fiction is a good skill to have when... ":P Kind of like Isaac Asimov?
JarredWalton - Saturday, April 22, 2006 - link
See what I get for not proofing carefully? LOL - that's the type of error I get most of the time. "A" for "the" is another common one.Gioron - Friday, April 21, 2006 - link
My brother swears by DNS, but using it myself and watching him use it I just can't stand going that slow. I've gotten to the point where I can type much faster than the speach recognition can handle it, and stopping to correct it just slows things down to a painful level. Of course, I'd probably have to learn to live with it if my wrists started bothering me, but until then...And then there's this bash.org quote:
http://www.bash.org/?34776">http://www.bash.org/?34776
<www666> this is so cool I'm typing with Dragon NaturallySpeaking in mIrc
<www666> no more typing
<LameLLama> www: try "thlash exit"
*** www666 has quit IRC (Leaving)
*** www666 ([email protected]int.ca) has joined #visualbasic
<www666> Hugh Masters
<www666> you basterdes
hans007 - Friday, April 21, 2006 - link
i used speech recognition with office xp when it came out. that was awful.my acura navigation has speech recognition which is also not well, that useful, its still easier to use buttons.
i honestly think it will never be better than just buttons.