One of the things that has always been entertaining for me is when technology attempts to interact with humans in a non-trivial fashion. While I was evaluating new phone systems for an old company I had/saw the opportunity to be able to experiment with some new phone lines with our VoIP phones. One of the things that I was working on was having the phone system 'attempt to tell a joke'. This idea was born from my previous experience with eMac systems back at University which had the ability to randomly tell a joke. They did this via a speech recognition system which had limited vocabulary (a group of phrases in order to reduce the search space and is a technique that I've seen commonly used on OEM speech control systems such as that which comes standard with Toshiba laptops). It also relied on an XML based file that contained the actual jokes itself. Thereafter, a randomised algorithm was used to select which joke to tell and what type of joke to tell (Knock knock, Why did the Chicken cross the road, etc...) While it was not quite as entertaining as a real human it was at least amusing and provided me with the idea for my little experiment. Random jokes from the web pulled using a web scraper or else manually downloaded which were then reformatted to be placed into a flat text file as follows.
'line number' 'tab character' 'joke'
Therafter, when and if required the string was encoded to 'wav' and/or another sound file format as required. Then a random number was chosen every once in a while to determine which particular joke to encode as a sound file. Then, when you call the, 'Joke line' a script is called to determine which file to play via 'espeak', 'festival', or any other speech synthesis software. Obviously, I tried playing around with speech recognition but when using the phone as a microphone on a network with 'jumpy' traffic on a VoIP based phone system this can make things a bit difficult. Maybe when I find myself on a network with more managable traffic I'll continue this line of research? Some of the results were very interesting. If I remember correctly though my experiments seemed to suggest that the load ratio would be about 30 people to a single server (Dual Xeon 2.8/4GB/10K SAS) before there would be a drastic drop in performance if I was thinking about completely automated phone based interviewing (a tangent that I was thinking about when I was working on web based interviewing technology (auto-generated code which worked around existing survey scripting languages) which was backwards compatible with SPSS Quancept scripts). Below are some of the notes from my research.
- http://www.voip-info.org/wiki/view/Asterisk+cmd+Festival, text2wave is basically a wrapper script for festival. Uses a LISP type language?
- init.scm and .festivalrc are two config files that are read at initialisation
- Utterance structure, http://www.cstr.ed.ac.uk/projects/festival/manual/festival_14.html
- english.wav is
Audio,araw,Mono,22050Hz,16
- jokes-clean.wav is
Audio,araw,Mono,8000Hz,16
man
- sox foo-in.wav -r 8000 -c 1 -s -w foo-out.wav resample -ql
; ######################################
; English Accent
; ######################################
exten => *777,1,Answer
exten => *777,2,Wait(1)
exten => *777,3,NoOp
exten => *777,4,System(/usr/bin/english)
exten => *777,5,Playback(/tmp/english-out)
exten => *777,6,Hangup
-
translates a sound file in SUN Sparc .AU format into a Microsoft .WAV file, while
sox -v 0.5 file.au -r 12000 file.wav mask
- notes to self, 8000Hz is completely incomprehensible, 22000Hz is much more realistic sample rate
- as usual thanks to all of the individuals and groups who purchase and use my goods and services
http://sites.google.com/site/dtbnguyen/
http://dtbnguyen.blogspot.com.au/
- as usual thanks to all of the individuals and groups who purchase and use my goods and services
http://sites.google.com/site/dtbnguyen/
http://dtbnguyen.blogspot.com.au/