Introduction

The initial goal was to achieve good word error rate using open source or free software. After trying some of the existing software available, there was one with impressively low WER values: Kaldi. I did some engineering, and found that Kaldi with the ASpIRE model works quite well out of the box for generic English speech recognition, however it missed almost all the technical words in the recordings I gave it. This did not come as a surprise as the ASpIRE model is one for generic conversational English – it was never trained on domain specific words. So, I wanted to extend its vocabulary and known grammar.

Let’s talk about speech recognition in general. From a simplified view, speech recognition engines process incoming speech and convert them to phones. The phones are then first mapped to words, and finally words are aligned together to form sentences. There are usually three logical parts of the models used during this process:

• Acoustic model: used during the acoustic feature extraction. The input is the raw audio and the outputs are acoustic features, phonemes or tri-phones. Something that can be connected to the atomic parts of a speech, and are useful for linking them together into logical groups, practically words.
• Dictionary: or sometimes also called vocabulary, lexicon. The list of words that exist in the language that the system can decode.
• Language model: or grammar, defines how words can be connected to each other. In some cases, it can be defined by a set of rules, e.g. in a helpdesk menu navigation system or when you want to understand simple commands. In more generic cases however, such as speech transcription in general a statistic approach is more often used. In the latter case, the language model is a large list of word tuples (n-grams) with assigned probabilities.

To extend the model with new words, you may extend the dictionary and the language model. Luckily, we don’t have to modify the acoustic model, since most probably the new words are also made up from the same phones that are used in everyday conversations. The system needs to know which new words or word-tuples have what probability of existence, so it can predict them. It will not predict something that does not exist in its corpus.

The following technical tutorial will guide you through booting up the base Kaldi with the ASpIRE model, and extending its language model and dictionary with new words or sentences of your choosing.

Note: In this tutorial assumes you are using Ubuntu 16.04 LTS.

You can skip this if you have a working Kaldi environment already.

Clone the Kaldi git repository, I’ve used commit “e227eda38339a2f7f0cbbc73b166823c13fdde25”.

git clone https://github.com/kaldi-asr/kaldi


Compile the Kaldi components according to the INSTALL file. Normally this should be easy.

The “egs” folder contains example models and scripts for Kaldi and the model data for them is available separately because of the large file sizes. We will uncompress the chain model data into the appropriate folder for the ASpIRE recipe.

cd kaldi/egs/aspire/s5
wget http://dl.kaldi-asr.org/models/0001_aspire_chain_model.tar.gz
tar xfv 0001_aspire_chain_model.tar.gz


To prepare the model for use, check the README.TXT, but I’ll list these preparation commands below:

steps/online/nnet3/prepare_online_decoding.sh --mfcc-config conf/mfcc_hires.conf data/lang_chain exp/nnet3/extractor exp/chain/tdnn_7b exp/tdnn_7b_chain_online


Note: this one will take some time to finish. Be patient.

utils/mkgraph.sh --self-loop-scale 1.0 data/lang_pp_test exp/tdnn_7b_chain_online exp/tdnn_7b_chain_online/graph_pp


At this point we can test if the (yet untouched) model works. Pick your favorite audio recording, convert it to 16-bit 8khz mono waveform. This is a must for the decoding to work properly, since the original model was also trained in audio files with such format. If you’re not sure how to convert to the appropriate format, use ffmpeg and the following snippet:

ffmpeg -i  -acodec pcm_s16le -ac 1 -ar 8000


These two scripts help setting up path and environment variables for Kaldi decoding, just run them once before working with the ASpIRE model

. cmd.sh
. path.sh


Finally, run the following command (transcribe your 8khz waveform):

online2-wav-nnet3-latgen-faster \
--online=false \
--do-endpointing=false \
--frame-subsampling-factor=3 \
--config=exp/tdnn_7b_chain_online/conf/online.conf \
--max-active=7000 \
--beam=15.0 \
--lattice-beam=6.0 \
--acoustic-scale=1.0 \
--word-symbol-table=exp/tdnn_7b_chain_online/graph_pp/words.txt \
exp/tdnn_7b_chain_online/final.mdl \
exp/tdnn_7b_chain_online/graph_pp/HCLG.fst \
'ark:echo utterance-id1 utterance-id1|' \
'scp:echo utterance-id1 |' \
'ark:/dev/null'


Don’t skip this step, make sure you’ve got this running before proceeding to the next one.
Ideally you should see something like this along the log entries:

utterance-id1 <the transcription of your audio file>

If you have this, you’re good. Let’s check just what some of the parameters mean.

online2-wav-nnet3-latgen-faster –> this is the decoding utility that can be used with this model, it resides in the kaldi/src/online2bin folder
–online=false –> set online decoding to false. Kaldi supports online decoding, which means that the transcription will start before the audio file is read completely. We turn this off for maximum accuracy.
–config=exp/tdnn_7b_chain_online/conf/online.conf –> this is the path for the configuration to use during decoding
–beam=15.0 –> this is the value for beam pruning. In a nutshell, increasing this value will result in better accuracy but slower speed – while decreasing it will make it faster, but less accurate.
–word-symbol-table=exp/tdnn_7b_chain_online/graph_pp/words.txt –> as Kaldi uses symbols (numbers) to identify words on the inside, this file allows the decoder to map the symbols back into human-readable words at the of the processing.
exp/tdnn_7b_chain_online/final.mdl –> this is the path to the acoustic model
exp/tdnn_7b_chain_online/graph_pp/HCLG.fst –> this is the path to the HCLG WFST graph
‘ark:echo utterance-id1 utterance-id1|’
‘scp:echo utterance-id1 |’ –> this contains the path to your wave file
‘ark:/dev/null’

Generating and merging your own corpus

In Kaldi, the HCLG.fst contains (aside from other things) the dictionary and the language model. Without going into much detail, Kaldi represents these in the form of a large weighted finite state transducer (if you want to know more, the Kaldi homepage is a good place to start). When “compiling” the dictionary and grammar into the HCLG.fst file, many optimizations are conducted, so changing the .fst file directly is out of the question. What we can do however, is to change the source files and recompile them into our own HCLG.fst.

Let’s see where these are located:

• The dictionary resides in the data/local/dict directory
• The language model is in the data/local/lm/3gram-mincount directory (in compressed format)

We will add new entries to these files, but first we must convert it to an appropriate format.

The tutorial will assume that you have a “corpus.txt” file containing sentences (and therefore words) that you want to include in the corpus for prediction. The format should be: one sentence per line. It can be one word per line, or sentences from a book, technical document, etc.

It is best to create a separate directory in “kaldi/egs/aspire/s5” for your own files, I will simply assume the use of the “new” directory in this tutorial.

Option A – Using CMU’s knowledge base tool

If you don’t mind sharing your corpus with a third party, CMU offers a Sphinx Knowledge Base Tool, where you can upload your corpus file and get a calculated language model and dictionary, that can be used readily. The tool however has some limitations on the size of the input file.

The link for the tool http://www.speech.cs.cmu.edu/tools/lmtool-new.html

You are going to need the generated .dic and .lm files to proceed.

Option B – the more fun, DIY way

Creating the dictionary

The dictionary contains words and their phonetic form, so as a first step we need to create phonetic reading for the new words.

First, we need to extract every unique word from the corpus, for that we can use this one liner:

grep -oE "[A-Za-z\\-\\']{3,}" corpus.txt | tr '[:lower:]' '[:upper:]' | sort | uniq > words.txt


The regex expression matches every word according that contains English alphabet, apostrophe or the dash character, that has at least 3 characters. The second command converts them to uppercase, the third sorts them and finally only unique entries are left at the end of the pipeline. We write the results to words.txt.

Sequitur is an application that can be trained to generate readings for new words. We will use CMU’s dictionary as a basis to train a pronunciation model – it is included in the ASpIRE model’s codebase. Run the following commands to train a model, note that it will take quite a while to finish (probably several hours). Or alternatively, you may download the model I’ve generated here and use that.

Build and install Sequitur G2P according to the instructions: https://www-i6.informatik.rwth-aachen.de/web/Software/g2p.html .

After installing, run the following script to train a model (in case you don’t want to use the one provided).

As these will take quite a while, best leave it on for the weekend.

Run from ‘kaldi/egs/aspire/s5/data/local/dict/cmudict/sphinxdict’:

g2p.py --train cmudict --devel 5% --write-model model-1

g2p.py --train cmudict_SPHINX_40 --devel 5% --write-model model-1
g2p.py --model model-1 --test cmudict_SPHINX_40 > model-1-test

g2p.py --model model-1 --ramp-up --train cmudict_SPHINX_40 --devel 5% --write-model model-2
g2p.py --model model-2 --test cmudict_SPHINX_40 > model-2-test

g2p.py --model model-2 --ramp-up --train cmudict_SPHINX_40 --devel 5% --write-model model-3
g2p.py --model model-3 --test cmudict_SPHINX_40 > model-3-test

g2p.py --model model-3 --ramp-up --train cmudict_SPHINX_40 --devel 5% --write-model model-4
g2p.py --model model-4 --test cmudict_SPHINX_40 > model-4-test

g2p.py --model model-4 --ramp-up --train cmudict_SPHINX_40 --devel 5% --write-model model-5
g2p.py --model model-5 --test cmudict_SPHINX_40 > model-5-test

g2p.py --model model-5 --ramp-up --train cmudict_SPHINX_40 --devel 5% --write-model model-6
g2p.py --model model-6 --test cmudict_SPHINX_40 > model-6-test

g2p.py --model model-6 --ramp-up --train cmudict_SPHINX_40 --devel 5% --write-model model-7
g2p.py --model model-7 --test cmudict_SPHINX_40 > model-7-test


You can check the error rate at the end of each ‘*-test’ file. Model #7 should already give you relatively low error values, you may continue this training procedure until the error rate is satisfying.

g2p.py --model model-7 --apply words.txt > words.dic


Creating the language model

After creating your dictionary, it is also possible to build a language model using a tool called SRILM. It can be obtained through this website: http://www.speech.sri.com/projects/srilm/download.html . I have used version 1.7.2 in the tutorial. After setting it up, run the following scripts.

Convert the corpus to uppercase, because the dictionary is in uppercase format

cat corpus.txt | tr '[:lower:]' '[:upper:]' > corpus_upper.txt


Generate a language model from the corpus

ngram-count -text corpus_upper.txt -order 3 -limit-vocab -vocab words.txt -unk -map-unk "" -kndiscount -interpolate -lm lm.arpa


Merging the input files

Now that we have our own dictionary and language model files, it is time to merge them with the one that is used in the original model. I’ve done this “manually” with the small python script I’ve wrote:

import io
import re
import sys
import collections

if len(sys.argv) != 7:
print(sys.argv[0] + "      ")
sys.exit(2)

[basedic, baselm, extdic, extlm, outdic, outlm] = sys.argv[1:7]

print("Merging dictionaries...")

words = collections.OrderedDict()
for dic in [basedic, extdic]:
with io.open(dic, 'r+') as Fdic:
for line in Fdic:
arr = line.strip().replace("\t", " ").split(" ", 1) # Sometimes tabs are used
[word, pronunciation] = arr
word = word.lower()
if word not in words:
words[word] = set([pronunciation.lower()])
else:

with io.open(outdic, 'w', newline='\n') as Foutdic:
for word in words:
for pronunciation in words[word]:
Foutdic.write(word + " " + pronunciation + "\n")

print("Merging language models...")

# Read LM entries - works only on 3-gram models at most
grams = [[], [], []]
for lm in [baselm, extlm]:
with io.open(lm, 'r+') as Flm:
mode = 0
for line in Flm:
line = line.strip()
if line == "\\1-grams:": mode = 1
if line == "\\2-grams:": mode = 2
if line == "\\3-grams:": mode = 3
arr = line.split()
if mode > 0 and len(arr) > 1:
if mode == 1 or mode == 2:
word = " ".join(arr[2:-1] if mode < 3 else arr[2:])
word = word.lower()
grams[mode - 1].append(line.lower())

with io.open(outlm, 'w', newline='\n') as Foutlm:
Foutlm.write(
u"\\data\\\n" +
u"ngram 1=" + str(len(grams[0])) + "\n"
u"ngram 2=" + str(len(grams[1])) + "\n"
u"ngram 3=" + str(len(grams[2])) + "\n"
)
for i in range(3):
Foutlm.write(u"\n\\" + str(i+1) + u"-grams:\n")
for g in grams[i]:
Foutlm.write(g + "\n")

Foutlm.write(u"\n\\end\\\n")


You might have to unzip the language model first:

gunzip -k data/local/lm/3gram-mincount/lm_unpruned.gz

The next script assumes you are in the 'new' directory. If you have used the CMU web service, you will need the .dic and the .lm files.

python mergedicts.py ../data/local/dict/lexicon4_extra.txt ../data/local/lm/3gram-mincount/lm_unpruned words.dic lm.arpa merged-lexicon.txt merged-lm.arpa

Note: all the script does is a simple "merge" of the model entries. It does not take the possible change in statistical probabilities for each original word into account, by the effect of the new added domain-specific words. To do a perfect merge, we would need to have the original text which is not available for free. Still even by using this simple method, I could get quite good results on recognizing new words or sentences.

Recompile the HCLG.fst

Now we have all the input files we need to re-compile the HCLG.fst using our new lexicon and grammar files.

Make sure you are in the ‘kaldi/egs/aspire/s5’ folder.

Let’s assume that:

• you have created a ‘new/local/dict’ folder and copied over the following files from ‘data/local/dict’:
• extra_questions.txt
• nonsilence_phones.txt
• optional_silence.txt
• silence_phones.txt
• you have copied your merged dictionary file into this folder as ‘lexicon.txt’
• you have created a new/local/lang folder with the merged language model in it called ‘lm.arpa’
# Set up the environment variables (again)
. cmd.sh
. path.sh

# Set the paths of our input files into variables
model=exp/tdnn_7b_chain_online
phones_src=exp/tdnn_7b_chain_online/phones.txt
dict_src=new/local/dict
lm_src=new/local/lang/lm.arpa

lang=new/lang
dict=new/dict
dict_tmp=new/dict_tmp
graph=new/graph

# Compile the word lexicon (L.fst)
utils/prepare_lang.sh --phone-symbol-table $phones_src$dict_src "" $dict_tmp$dict

# Compile the grammar/language model (G.fst)
gzip  $lm_src.gz utils/format_lm.sh$dict $lm_src.gz$dict_src/lexicon.txt $lang # Finally assemble the HCLG graph utils/mkgraph.sh --self-loop-scale 1.0$lang $model$graph

# To use our newly created model, we must also build a decoding configuration, the following line will create these for us into the new/conf directory
steps/online/nnet3/prepare_online_decoding.sh --mfcc-config conf/mfcc_hires.conf $dict exp/nnet3/extractor exp/chain/tdnn_7b new  Let’s test our newly made model now (the configuration, word symbol-table and the HCLG graph paths are updated): online2-wav-nnet3-latgen-faster \ --online=false \ --do-endpointing=false \ --frame-subsampling-factor=3 \ --config=new/conf/online.conf \ --max-active=7000 \ --beam=15.0 \ --lattice-beam=6.0 \ --acoustic-scale=1.0 \ --word-symbol-table=new/graph/words.txt \ exp/tdnn_7b_chain_online/final.mdl \ new/graph/HCLG.fst \ 'ark:echo utterance-id1 utterance-id1|' \ 'scp:echo utterance-id1 |' \ 'ark:/dev/null'  So there you have it! You may now use it to decode your audio files, or write a web-server around it and serve it as a transcription REST interface, even write a HTML5 & JavaScript backed frontend to transcribe stuff you say to your microphone instantly. It’s up to you how you use this amazing tool that is available for you thanks to the people developing Kaldi and the enormous research they’ve put behind it. Links and more background materials: 50 thoughts on “Kaldi ASR: Extending the ASpIRE model” 1. frizi says: Great work ! Do you know if it is possible to learn another language using this method (for exemple by only keeping the new vocabulary)?If it is not, is there a way to modify the pre trained acoustic model in order to do it ? Like 1. Thanks. Unfortunately I think it is not possible since another language would require a different acoustic model. Modifying an existing acoustic model is something I did not get into, so I cannot say much about it, but I believe it is a much more complex task than modifying the language model or dictionary entries. However if you can obtain a complete chain (including acoustic model) for that other language, you may be able to modify the words in it, in a similar way as in this blog entry. Liked by 1 person 2. grib says: Hey ! Would you know how to write the result of the recognition in a txt file rather than only on the shell? Like 1. First you need to change the last line “ark:/dev/null” to something like “ark,t:output.lat”. This way you can get the lattices in a text format. You can then select the best available lattice: lattice-best-path –acoustic-scale=0.1 “ark,t:output.lat” “ark,t:output.txt” output.txt contains the best matching predicted text, but in dictionary ID numbers. You can finally get a human readable form by converting them to word symbols using the dictionary: %kaldi-path%/egs/wsj/s5/utils/int2sym.pl -f 2- %path-to-you-dictionary%/words.txt < output.txt Alternatively you can modify the "online2-wav-nnet3-latgen-faster.cc" & recompile 🙂 since it kind of does the same post-processing, but AFAIR you cannot make it write the final prediction to a file directly, just the lattices. Liked by 1 person 2. PREPAID CUSTOMER says: I wanted the same thing. I saved the output to a file with > /tmp/output.txt, then worked up this gigantic sed that removes everything except the phrase: sed ‘/^utterance-id1/!d;s/utterance-id1//g;s///g;s/$noise$//g;s/^[ ]*//g’ /tmp/output.txt 🙂 Liked by 1 person 1. Hafiz Farhan Ahmad says: Prepaid Customer, How you did that? Can you please write the complete commands and procedure. I am failing to do that! I am getting results on the console. Note: I am decoding my wav file with pretrained model. Like 3. davo says: Hi ! Do you know which file in the Kaldi recipe they used in order to generate the tdnn_7b_chain_online folder? ( indeed I find the files used for the nnet3 and chain folders but not this one). And furthermore, why dis you use the tdnn_7b_chain_online model ? Is it better than the models mentioned previously ? Thank you ! Davo Like 1. Hi! When I’ve write the blog entry I believe that was the most up-to-date model (not sure if it still is). I’m not sure about the recipe, I think you should ask in the official Kaldi forums: http://kaldi-asr.org/forums.html Hope this helps Like 4. ash says: Hi I want to rescore the N-best list of existing model after interpolating old LM with my new LM, Do I have to recompile HCLG.fst file? thanks Like 1. Hi, I’m afraid I’m not sure about the problem, could you elaborate a bit on what you want to achieve? Like 1. ash says: Hi – I want to improve WER of existing model by adapting the background LM ( used in when in HCLG.fst) with my in domain LM via interpolation, Do I have to create G.fst for the interpolated LM and recompile HCLG.fst? , or just do lattice rescoring with the new LM without changing the HCLG.fst. thanks Like 5. RiKi says: This is a great post. Thank you Krisztián. I am trying to test the aspire model using an application wrapper around kaldi i.e i am not passing the parameters via command line but passing them through code. I have run into issues doing so most probably due to my lack of understanding of the following commands: ‘ark:echo utterance-id1 utterance-id1|’ \ ‘scp:echo utterance-id1 |’ \ ‘ark:/dev/null’ Could you please elaborate a little bit on what they do and how I can change it so that i can code that command into my application? Thank you. Like 1. Thanks Riki. Check out Lecture 1, Page 44 here: https://sites.google.com/site/dpovey/kaldi-lectures . These are input specifiers for parameters. The first one pairs the speaker ID with the utterance ID, the second one specifies which sound file belongs to which utterance (check out how I used it) – and the third one is the lattice output, if you’d like them written to a file for further processing (but if you are recompiling anyways, it is easier to post-process the lattices in the code rather than writing to a file). Like 1. RiKi says: Thank you. Like 6. riebling27 says: I also noticed that missing folder. (but I didn’t run all the preparation scripts first) I’m wondering if this would explain why I’m also missing conf/online.conf which is not available in the models download. Like 1. I think you can use “steps/online/nnet3/prepare_online_decoding.sh” to generate the configuration files. Like 7. Praveen Yadav says: Hi, When I execute – utils/mkgraph.sh –self-loop-scale 1.0 data/lang_pp_test exp/tdnn_7b_chain_online exp/tdnn_7b_chain_online/graph_pp I get : utils/mkgraph.sh: line 75: tree-info: command not found Error when getting context-width Not sure if we need to build/compile before. Any idea ? Thanks Like 8. John AI says: phones.txt in the exp folder is lowercase, while the ones in the dictionary you get from either CMU or g2p is uppercase also there is no unk but eps Like 9. Hey thank you very much its excellent! I want to know is it possible to make language model in this way to limit recognition only to our corpus text if yes can you tell me more please? Like 1. Hi! Yes it should be possible, you just need to skip the “Merging the input files” part and use your newly generated dictionary and language models directly when compiling the HCLG.fst. This way the output should be limited to the models generated from your corpus only. Liked by 1 person 1. Ehsan Safdari says: now every thing is excellent with it 🙂 thank you very much . Could you please give me a link or guide to how use it continuous listening server like websocket or locally stream microphone signal to it ? Like 2. No problem. In my proof of concept I only went as far as record audio from a browser and upload the waveform to a backend for processing, and then respond with the recognized text. So it was semi-online, where a “transcribing” phase followed the “recording” phase. Recording audio is from browser is not that big challenge as there are many good tutorials online. Next step would probably be to build a webservice that supports websocket or some other connection-oriented protocol to stream the waveform, and channel it to a well-parameterized kaldi executable (you probably also need to set –online=true in the parameters) and stream back the text you get from it. Like 10. Peter says: I got up to the ‘make graph’ part, and after about 2 hrs got an error message….. ========= utils/mkgraph.sh –self-loop-scale 1.0 data/lang_pp_test exp/tdnn_7b_chain_online exp/tdnn_7b_chain_online/graph_pp tree-info exp/tdnn_7b_chain_online/tree tree-info exp/tdnn_7b_chain_online/tree fstpushspecial fstdeterminizestar –use-log=true fstminimizeencoded fsttablecompose data/lang_pp_test/L_disambig.fst data/lang_pp_test/G.fst utils/mkgraph.sh: line 92: 7689 Done fsttablecompose$lang/L_disambig.fst $lang/G.fst 7690 Killed | fstdeterminizestar –use-log=true 7691 | fstminimizeencoded 7692 | fstpushspecial 7693 | fstarcsort –sort_type=ilabel >$lang/tmp/LG.fst.

================

Do you know what is causing it please ?

Like

1. Hmm, not sure. It has been a while since I wrote this tutorial, the base code might have changed since then. You could check if you’re using the same revision as I did back then.

Like

2. PREPAID CUSTOMER says:

I think you ran out of memory. I had similar issues. After making an 8GB (!) swap file, I was able to complete.

Liked by 1 person

11. Sonal says:

Hey, can we decode MFCCs instead of audio files using ‘online2-wav-nnet3-latgen-faster’?

Like

1. yizhak says:

Hi Krisztián, Thank you for your great post!

Do you know how to get the start and end times for each word?
what parameters should I specify?

Like

1. yizhak says:

OK, I’ve found a solution to my question, a command for example is (thanks to Nickolay Shmyrev for his help):

online2-wav-nnet3-latgen-faster –frame-subsampling-factor=3 –config=exp/tdnn_7b_chain_online/conf/online.conf –acoustic-scale=1.0 –word-symbol-table=exp/tdnn_7b_chain_online/graph_pp/words.txt exp/tdnn_7b_chain_online/final.mdl exp/tdnn_7b_chain_online/graph_pp/HCLG.fst ‘ark:echo utterance-id1 utterance-id1|’ ‘scp:echo utterance-id1 /tmp/sample_1_8khz.wav|’ ark:- | lattice-to-ctm-conf ark:- – | int2sym.pl -f 5 exp/tdnn_7b_chain_online/graph_pp/words.txt

The above command’s long running time can be shortened by adding some other parameters, e.g –online=false etc.

Have a nice day, and again, thanks for the helpful post.

Liked by 1 person

2. Rebecca Marie Jones says:

Do you have a link to where in the documentation you found the answer to this question? I am also interested in getting the start and end times for each word. I tried adapting the sample command you posted below, but it is not working for me. What is ark:- – |  doing in the command?

Like

12. yizhak says:

Hi,

I have another few questions about the code that is used to merge the .arpa files to one .arpa file:

print(“Merging language models…”)

# Read LM entries – works only on 3-gram models at most
grams = [[], [], []]
for lm in [baselm, extlm]:
with io.open(lm, ‘r+’) as Flm:
mode = 0
for line in Flm:
line = line.strip()
if line == “\\1-grams:”: mode = 1
if line == “\\2-grams:”: mode = 2
if line == “\\3-grams:”: mode = 3
arr = line.split(” “)
if mode > 0 and len(arr) > 1:
if mode == 1 or mode == 2:
word = ” “.join(arr[2:-1] if mode < 3 else arr[2:])
word = word.lower()
grams[mode – 1].append(line.lower())

1. I don't understand where the variable "word" is used. Maybe "line.lower()" in the line after the if condition should be replaced with "word"?
2. why you re-check if mode < 3 inside the if condition?
3. why you take arr from index 2 and not from index 1? In standard .arpa files the phrase starts in index 1 after splitting the line.

Best, Yizhak.

Like

13. Hi, it’s been a while since I wrote it, but right now it seems this part was not used in the end:

if mode == 1 or mode == 2:
……word = ” “.join(arr[2:-1] if mode < 3 else arr[2:])
……word = word.lower()

It may work if you just remove these lines. I suspect it might have been there partly because first I wanted to parse and reconstruct the lines in the language models in order to remove redundancy, but then probably it didn't affect the outcome so just copying the whole line seemed easier.

Like

14. Ruslan says:

Hi.
thank you very much for the great work!

In part “Merging the input files”, it is written, that “If you have used the CMU web service, you will need the .dic and the .lm files.”
Can you give me link to this service?

Like

1. Ruslan says:

yes, I agree with you. I saw after I wrote the comment.
Thank you.

Like

15. Ruslan says:

Hi.
One more question.
When I run the script :
python mergedic.py old.dic old.lm new.dic new.lm old+new.dic old+new.lm
I get the error:

File “mergedic.py”, line 19, in
[word, pronunciation] = arr
ValueError: need more than 1 value to unpack

what the problem?
In .dic files – (

mother M AH1 DH AXR
word W ER1 D

etc

Like

1. There might be some lines without associated reading. I suggest to add a check if the length of arr is less than 2 to skip processing that line.

Like

16. Lia says:

Hi Krisztián,

First of all thank you so much for this tutorial – it really is the best there is.

You have mentioned that to re-distribute the probabilities of words we would need to acquire the original dataset. I have two questions ( I am new to this, so apologies in advance if the answer is obvious).

1. If I obtain the original dataset and combine it with my personal dataset/dictionary – can I retrain the language model using original Aspire acoustic model?

2. Can I train a brand new language model based on my own dataset using Aspire acoustic model?

Thank you,
Lia

Like

1. Hi Lia and thanks!

1. I believe not, you would need the original corpus they have used. I recall they have used the Fisher-English corpus that is subject to license by the Linguistics Data Consortium and is quite expensive. The compiled model does not have this raw information anymore, therefore it is not subject to this license.

2. If your dataset is English, you can generate a dictionary and a language model from it. You can choose to just use your own dictionary/language model and skip the “merging” step. Your generated models alone should work well with the Aspire acoustic model.

Hope this helps

Like

17. Hi! Thanks so much, Krisztián, for documenting your steps in such detail.

I want to report a subtle bug I noticed in merge_lms.py, which is that on line 43:

arr = line.split(” “)

…should be…

arr = line.split()

…so that it splits on tabs and spaces, not just spaces.

As it stands, the 1-grams in lm.arpa (the new 1-grams being added) are skipped entirely by line 44, since the 1-gram lines produced by ngrams-count do not have any spaces — the fields are tab-delimited. (It might be that an earlier version of ngrams-count produced space-delimited fields, since the LM that ships with the Aspire model does have space-delimited fields; not sure what the history is there.)

Thanks again!

Like

1. Hi Doug,

Yes you may be right, I don’t remember running into problems with this script as it is, but it’s likely that it’s incompatible with a newer formats. Just in case, I have updated the code in the post not to restrict the split to spaces.

Thanks for you feedback!

Like

18. Kidist says:

Hey, thank you for the post. I am trying the aspire model and as following your steps, I got an error like ‘online2-wav-nnet3-latgen-faster command not found’. Is it something related to the path.sh ?

Like

> ‘scp:echo utterance-id1 |’ –> this contains the path to your wave file

Something missing here? For clarity this should be:

‘scp:echo utterance-id1 path-to.wav |’ –> replace path-to.wav with your wav file.

Like

Thanks for an excellent tutorial, by the way.

Like

21. Ernst says:

Hi Krisztián,
thank you so much for an excellent post!!!
I would like to enhance the model just with persons’ last names – so no corpus. Do you happen to know how to do that?
Thanks a lot,
Ernst

Like

1. Hi Ernst, and thank you. I think you can easily do this by filling out the “corpus.txt” with only one last name per line.

Like

22. Hi Krisztián,

Thanks for the excellent post. I am running into a problem first time I try to run online2-wav-nnet3-latgen-faster. I get the error:

online2-wav-nnet3-latgen-faster: symbol lookup error: /home/avalen02/kaldi/src/lib/libkaldi-ivector.so: undefined symbol: _ZN5kaldi13MessageLoggerD1Ev

Do you have any clues?

Thanks
Andy

Like

23. Hi!
We have been trying to use this method to build our own corpus and language model. The system works perfectly till this command : utils/format_lm.sh $dict$lm_src.gz $dict_src/lexicon.txt$lang

Irrespective of whether we merge the language model we built with an existing language model or use our SRILM built model alone, the error is the same : “invalid n-gram data line”.

We feel that the lines that are mentioned as invalid are probably considering whitespace as a “gram” (but we aren’t really sure). Some of the lines showing errors have been highlighted in yellow.
The corpus, lm.arpa and lexicon.txt are uploaded here :

We have exactly followed your commands and yet are not able to figure out this error. Please let us know what you think of this issue.

the error is :
=========================================================
utils/format_lm.sh $dict$lm_src.gz $dict_src/lexicon.txt$lang
Converting ‘TechnicalCorpus/local/lang/lm.arpa.gz’ to FST
ERROR (arpa2fst[5.5.206~1-abfbc5]:Read():arpa-file-parser.cc:185) line 335 [-2.770244 able]: Invalid n-gram data line

[ Stack-Trace: ]
kaldi::MessageLogger::HandleMessage(kaldi::LogMessageEnvelope const&, char const*)
kaldi::FatalMessageLogger::~FatalMessageLogger()
main
__libc_start_main
_start

ERROR (arpa2fst[5.5.206~1-abfbc5]:Read():arpa-file-parser.cc:185) line 335 [-2.770244 able]: Invalid n-gram data line

[ Stack-Trace: ]
kaldi::MessageLogger::HandleMessage(kaldi::LogMessageEnvelope const&, char const*)
kaldi::MessageLogger::~MessageLogger()
kaldi::FatalMessageLogger::~FatalMessageLogger()
main
__libc_start_main
_start

===========================================================

Like

24. Neil Rao says:

Great article. I got to the part where we test it, but it is saying the “echo utterance-id1″ line is invalid for me. Any ideas?

“
WARNING (online2-wav-nnet3-latgen-faster[5.5.277~1-b180]:ReadScriptFile():kaldi-table.cc:72) Invalid 1’th line in script file:”utterance-id1”
WARNING (online2-wav-nnet3-latgen-faster[5.5.277~1-b180]:ReadScriptFile():kaldi-table.cc:46) [script file was: ‘echo utterance-id1 |’]
Error opening RandomAccessTableReader object (rspecifier is: scp:echo utterance-id1 |)
“

Like

1. Neil Rao says:

Ah wait, figured it out. For those with the same problem: pass in the wav file in the parameter: ‘scp:echo utterance-id1 test.wav|’

Like