Speaking of Speech Recognition... Check out Julius, the OSS high-performance, two-pass large vocabulary continuous speech recognition (LVCSR) decoder
"Julius" is a high-performance, two-pass large vocabulary continuous speech recognition (LVCSR) decoder software for speech-related researchers and developers. Based on word N-gram and context-dependent HMM, it can perform almost real-time decoding on most current PCs in 60k word dictation task. Major search techniques are fully incorporated such as tree lexicon, N-gram factoring, cross-word context dependency handling, enveloped beam search, Gaussian pruning, Gaussian selection, etc. Besides search efficiency, it is also modularized carefully to be independent from model structures, and various HMM types are supported such as shared-state triphones and tied-mixture models, with any number of mixtures, states, or phones. Standard formats are adopted to cope with other free modeling toolkit such as HTK, CMU-Cam SLM toolkit, etc.
The main platform is Linux and other Unix workstations, and also works on Windows. Most recent version is developed on Linux and Windows (cygwin / mingw), and also has Microsoft SAPI version. Julius is distributed with open license together with source codes.
- An open-source software (see terms and conditions of license)
- Real-time, hi-speed, accurate recognition based on 2-pass strategy.
- Low memory requirement: less than 32MBytes required for work area (<64MBytes for 20k-word dictation with on-memory 3-gram LM).
- Supports LM of N-gram, grammar, and isolated word.
- Language and unit-dependent: Any LM in ARPA standard format and AM in HTK ascii hmmdefs format can be used.
- Highly configurable: can set various search parameters. Also alternate decoding algorithm (1-best/word-pair approx., word trellis/word graph intermediates, etc.) can be chosen.
- Full source code documentation and manual in Engligh / Japanese.
- List of major supported features:
- On-the-fly recognition for microphone and network input
- GMM-based input rejection
- Successive decoding, delimiting input by short pauses
- N-best output
- Word graph output
- Forced alignment on word, phoneme, and state level
- Confidence scoring
- Server mode and control API
- Many search parameters for tuning its performance
- Character code conversion for result output.
- (Rev. 4) Engine becomes Library and offers simple API
- (Rev. 4) Long N-gram support
- (Rev. 4) Run with forward / backward N-gram only
- (Rev. 4) Confusion network output
- (Rev. 4) Arbitrary multi-model decoding in a single thread.
- (Rev. 4) Rapid isolated word recognition
- (Rev. 4) User-defined LM function embedding
Get Julius for Windows SAPIJulius for SAPI is MS Windows version of Julius/Julian which implements Microsoft(R) Speech API (SAPI) 5.1. You can use this version of Julius as a SAPI Voice Recognizer in applications created for SAPI (e.g. Office XP).
The recent version is fully SAPI-5.1 compliant, and it also supports SALT extension.
Julius for SAPI assumes that the user language and the application's grammar is in Japanese. So it is a little troublesome in case of the other languages because Julius for SAPI does not know the pronunciation of the words in a grammar. If you define pronunciations to each of these, it may work, but we have not tried it.
About ModelsSince Julius itself is a language-independent decoding program, you can make a recognizer of a language if given an appropriate language model and acoustic model for the target language. The recognition accuracy largely depends on the models.
Julius adopts acoustic models in HTK ascii format, pronunciation dictionary in almost HTK format, and word 3-gram language models in ARPA standard format (forward 2-gram and reverse 3-gram trained from same corpus).
We had already examined English dictations with Julius, and another researcher has reported that Julius has also worked well in English, Slovenian (see pp.681--684 of Proc. ICSLP2002), French, Thai language, and many other Languages.
Here you can get Japanese and English free language/acoustic models.
- Japanese language model (20k-word trained by newspaper article) and acoustic models (Phonetic tied-mixture triphone / monophone)
- We currently have a sample English acoustic model trained from the WSJ database. According to the license of the database, this model *cannot* be used to develop or test products for commercialization, nor can they use it in any commercial product or for any commercial purpose. Also, the performance is not so good. Please contact to us for further information.
- The VoxForge-project is working on the creation of an open-source acoustic model for the English language.
If you have any language or acoustic model that can be distributed as a freeware, would you please contact us? We want to run dictation kit on various languages other than Japanese, and share them freely to provide a free speech recognition system available for various languages.
This is pretty far out there for a guy like me, but I've not seen to many CSR's and the fact that it can use SAPI was interesting to me. And given all the recent "voice" hype I thought you might find it interesting, or at least a little different, too...
Related Past Post XRef:
[Reminder] You don't need a Kinect to do Speech Recognition on Windows