Tuesday, April 29, 2008

Update: Recorded Speech to Text Conversion

Every now and then, I run into someone who is doing qualitative research and wants to know about speech-to-text conversion. Here is an update that will answer some of the questions I have been asked during the past year or so in this area. I know some people will be doing interviews this summer, so maybe this will help. The basic idea is that you have recorded something -- an interview, perhaps -- and now you want to connect your recorder to your computer and have the interview become automatically converted into a Word document. If the computer knows when to insert a colon rather than a semicolon, so much the better. Unfortunately, that's still a dream, according to an article published last week by James A. Martin of PCWorld. Then again, a review by Nate Anderson of Ars Technica indicates that Nuance's Dragon NaturallySpeaking, a leader in this area, has made great advances in recent years. Anderson provides an illustration of several paragraphs in which Naturally Speaking actually outperformed his own typing at the keyboard -- provided that he first sat down with a microphone and trained the software to understand his pronunciation. Minimum training requires a few minutes; best results come after several months of use. So the program may do pretty well with your side of the interview -- but less so, most likely, with the words uttered by the interviewee, who will ordinarily be doing most of the talking. (Note: I have seen "Naturally Speaking" spelled both with and without an internal space, so you may want to try both if you're searching for more information. See e.g., http://tinyurl.com/58238m for suggested search syntax.) In theory, you could train the software to understand the voice of the interviewee instead -- by e.g., having him/her read a section of text into the recorder, and then teaching Naturally Speaking to understand it later, in the privacy of your own communal workstation, and in that case I think it would be your questions, not his/her answers, that would require repair afterwards (though perhaps you could avoid that by simply cutting and pasting pre-typed versions of your questions into your interview transcript. Naturally Speaking allows for multiple user profiles, so apparently this would require the mere addition and training of another voice account, without having to delete the previously entered account. Best results in this area have traditionally come from dictating into a high-quality microphone directly connected to the PC. Now, however, Anderson says that Naturally Speaking does an "acceptable" job even when the recording was made using the internal microphone on a mediocre MP3 player. Martin's point, in his article, was that the Sony ICD-MX20DR9 digital voice recorder (now about $230 plus $25 or so for an additional memory card) comes with a copy of Naturally Speaking (so you don't have to buy a copy separately), was designed for use with Naturally Speaking, and is listed as a compatible model on Nuance's Hardware Compatibility List. Among the numerous options on Nuance's Hardware Compatibility List (e.g., Headset Microphones (legacy)), there are two for recorders specifically. On the Recorders (legacy) list, only five old models get a three-star rating. On the Recorders (current) list, by contrast, the Sony ICD-MX20 (I assume the DR9 suffix simply means that Dragon Naturally Speaking is included with the hardware) gets six stars -- and is the only recorder that gets more than five stars. I suspect each additional star means, in practice, a somewhat higher percentage of accuracy -- which could translate into many hours of seeking and correcting text. So if you were doing your interviews within the next few weeks, one approach would be to shoot first and ask questions later -- i.e., buy the Sony, train it, experiment with it, and see if it saves you a ton of typing. That might make less sense if you were doing your interviews in a noisy environment, though. There are some differences among versions of Naturally Speaking (ranging in price from $60 (standard) to $1,200(legal)). I'm not sure which version comes with the Sony ICD-MX20DR9. In a detailed review of the software (i.e., not the Sony), Cade Metz of PC Magazine says the "results were pretty darn good," and describes the option of doing voice rather than keyboard correction of errors -- which may be available only on more expensive versions (not sure). Elsa Wenzel of ZDNet seconds Metz's view that Naturally Speaking is the best consumer voice recognition program available. As is often the case, however, users' opinions vary dramatically -- possibly as a function of having the right hardware and/or doing the required system training in the recommended manner. Off the topic of interviews, but of social work relevance: Rita Zeidner of the Washington Post points out Naturally Speaking's productivity implications for persons with disabilities.