Kaldi ASR: Extending the ASpIRE model

Introduction

The initial goal was to achieve good word error rate using open source or free software. After trying some of the existing software available, there was one with impressively low WER values: Kaldi. I did some engineering, and found that Kaldi with the ASpIRE model works quite well out of the box for generic English speech recognition, however it missed almost all the technical words in the recordings I gave it. This did not come as a surprise as the ASpIRE model is one for generic conversational English – it was never trained on domain specific words. So, I wanted to extend its vocabulary and known grammar.

Let’s talk about speech recognition in general. From a simplified view, speech recognition engines process incoming speech and convert them to phones. The phones are then first mapped to words, and finally words are aligned together to form sentences. There are usually three logical parts of the models used during this process:

  • Acoustic model: used during the acoustic feature extraction. The input is the raw audio and the outputs are acoustic features, phonemes or tri-phones. Something that can be connected to the atomic parts of a speech, and are useful for linking them together into logical groups, practically words.
  • Dictionary: or sometimes also called vocabulary, lexicon. The list of words that exist in the language that the system can decode.
  • Language model: or grammar, defines how words can be connected to each other. In some cases, it can be defined by a set of rules, e.g. in a helpdesk menu navigation system or when you want to understand simple commands. In more generic cases however, such as speech transcription in general a statistic approach is more often used. In the latter case, the language model is a large list of word tuples (n-grams) with assigned probabilities.

To extend the model with new words, you may extend the dictionary and the language model. Luckily, we don’t have to modify the acoustic model, since most probably the new words are also made up from the same phones that are used in everyday conversations. The system needs to know which new words or word-tuples have what probability of existence, so it can predict them. It will not predict something that does not exist in its corpus.

The following technical tutorial will guide you through booting up the base Kaldi with the ASpIRE model, and extending its language model and dictionary with new words or sentences of your choosing.

Note: In this tutorial assumes you are using Ubuntu 16.04 LTS.

Download and install Kaldi and the ASpIRE model

You can skip this if you have a working Kaldi environment already.

Clone the Kaldi git repository, I’ve used commit “e227eda38339a2f7f0cbbc73b166823c13fdde25”.

git clone https://github.com/kaldi-asr/kaldi

Compile the Kaldi components according to the INSTALL file. Normally this should be easy.

Download & untar the ASpIRE chain model using the link provided at: http://kaldi-asr.org/models.html

The “egs” folder contains example models and scripts for Kaldi and the model data for them is available separately because of the large file sizes. We will uncompress the chain model data into the appropriate folder for the ASpIRE recipe.

cd kaldi/egs/aspire/s5
wget http://dl.kaldi-asr.org/models/0001_aspire_chain_model.tar.gz
tar xfv 0001_aspire_chain_model.tar.gz

To prepare the model for use, check the README.TXT, but I’ll list these preparation commands below:

steps/online/nnet3/prepare_online_decoding.sh --mfcc-config conf/mfcc_hires.conf data/lang_chain exp/nnet3/extractor exp/chain/tdnn_7b exp/tdnn_7b_chain_online

Note: this one will take some time to finish. Be patient.

utils/mkgraph.sh --self-loop-scale 1.0 data/lang_pp_test exp/tdnn_7b_chain_online exp/tdnn_7b_chain_online/graph_pp

At this point we can test if the (yet untouched) model works. Pick your favorite audio recording, convert it to 16-bit 8khz mono waveform. This is a must for the decoding to work properly, since the original model was also trained in audio files with such format. If you’re not sure how to convert to the appropriate format, use ffmpeg and the following snippet:

ffmpeg -i <your_original_file.wav> -acodec pcm_s16le -ac 1 -ar 8000 <your_8khz_file.wav>

These two scripts help setting up path and environment variables for Kaldi decoding, just run them once before working with the ASpIRE model

. cmd.sh
. path.sh

Finally, run the following command (transcribe your 8khz waveform):

online2-wav-nnet3-latgen-faster \
  --online=false \
  --do-endpointing=false \
  --frame-subsampling-factor=3 \
  --config=exp/tdnn_7b_chain_online/conf/online.conf \
  --max-active=7000 \
  --beam=15.0 \
  --lattice-beam=6.0 \
  --acoustic-scale=1.0 \
  --word-symbol-table=exp/tdnn_7b_chain_online/graph_pp/words.txt \
  exp/tdnn_7b_chain_online/final.mdl \
  exp/tdnn_7b_chain_online/graph_pp/HCLG.fst \
  'ark:echo utterance-id1 utterance-id1|' \
  'scp:echo utterance-id1 <your_8khz_file.wav>|' \
  'ark:/dev/null'

Don’t skip this step, make sure you’ve got this running before proceeding to the next one.
Ideally you should see something like this along the log entries:

utterance-id1 <the transcription of your audio file>

If you have this, you’re good. Let’s check just what some of the parameters mean.

online2-wav-nnet3-latgen-faster –> this is the decoding utility that can be used with this model, it resides in the kaldi/src/online2bin folder
  –online=false –> set online decoding to false. Kaldi supports online decoding, which means that the transcription will start before the audio file is read completely. We turn this off for maximum accuracy.
  –config=exp/tdnn_7b_chain_online/conf/online.conf –> this is the path for the configuration to use during decoding
  –beam=15.0 –> this is the value for beam pruning. In a nutshell, increasing this value will result in better accuracy but slower speed – while decreasing it will make it faster, but less accurate.
  –word-symbol-table=exp/tdnn_7b_chain_online/graph_pp/words.txt –> as Kaldi uses symbols (numbers) to identify words on the inside, this file allows the decoder to map the symbols back into human-readable words at the of the processing.
  exp/tdnn_7b_chain_online/final.mdl –> this is the path to the acoustic model
  exp/tdnn_7b_chain_online/graph_pp/HCLG.fst –> this is the path to the HCLG WFST graph
  ‘ark:echo utterance-id1 utterance-id1|’
  ‘scp:echo utterance-id1 |’ –> this contains the path to your wave file
  ‘ark:/dev/null’

Generating and merging your own corpus

In Kaldi, the HCLG.fst contains (aside from other things) the dictionary and the language model. Without going into much detail, Kaldi represents these in the form of a large weighted finite state transducer (if you want to know more, the Kaldi homepage is a good place to start). When “compiling” the dictionary and grammar into the HCLG.fst file, many optimizations are conducted, so changing the .fst file directly is out of the question. What we can do however, is to change the source files and recompile them into our own HCLG.fst.

Let’s see where these are located:

  • The dictionary resides in the data/local/dict directory
  • The language model is in the data/local/lm/3gram-mincount directory (in compressed format)

We will add new entries to these files, but first we must convert it to an appropriate format.

The tutorial will assume that you have a “corpus.txt” file containing sentences (and therefore words) that you want to include in the corpus for prediction. The format should be: one sentence per line. It can be one word per line, or sentences from a book, technical document, etc.

It is best to create a separate directory in “kaldi/egs/aspire/s5” for your own files, I will simply assume the use of the “new” directory in this tutorial.

Option A – Using CMU’s knowledge base tool

If you don’t mind sharing your corpus with a third party, CMU offers a Sphinx Knowledge Base Tool, where you can upload your corpus file and get a calculated language model and dictionary, that can be used readily. The tool however has some limitations on the size of the input file.

The link for the tool http://www.speech.cs.cmu.edu/tools/lmtool-new.html

You are going to need the generated .dic and .lm files to proceed.
You may now skip to the part: ‘Merging the input files’.

Option B – the more fun, DIY way

Creating the dictionary

The dictionary contains words and their phonetic form, so as a first step we need to create phonetic reading for the new words.

First, we need to extract every unique word from the corpus, for that we can use this one liner:

grep -oE "[A-Za-z\\-\\']{3,}" corpus.txt | tr '[:lower:]' '[:upper:]' | sort | uniq > words.txt

The regex expression matches every word according that contains English alphabet, apostrophe or the dash character, that has at least 3 characters. The second command converts them to uppercase, the third sorts them and finally only unique entries are left at the end of the pipeline. We write the results to words.txt.

Sequitur is an application that can be trained to generate readings for new words. We will use CMU’s dictionary as a basis to train a pronunciation model – it is included in the ASpIRE model’s codebase. Run the following commands to train a model, note that it will take quite a while to finish (probably several hours). Or alternatively, you may download the model I’ve generated here and use that.

Build and install Sequitur G2P according to the instructions: https://www-i6.informatik.rwth-aachen.de/web/Software/g2p.html .

After installing, run the following script to train a model (in case you don’t want to use the one provided).

As these will take quite a while, best leave it on for the weekend.

Run from ‘kaldi/egs/aspire/s5/data/local/dict/cmudict/sphinxdict’:

g2p.py --train cmudict --devel 5% --write-model model-1

g2p.py --train cmudict_SPHINX_40 --devel 5% --write-model model-1
g2p.py --model model-1 --test cmudict_SPHINX_40 > model-1-test

g2p.py --model model-1 --ramp-up --train cmudict_SPHINX_40 --devel 5% --write-model model-2
g2p.py --model model-2 --test cmudict_SPHINX_40 > model-2-test

g2p.py --model model-2 --ramp-up --train cmudict_SPHINX_40 --devel 5% --write-model model-3
g2p.py --model model-3 --test cmudict_SPHINX_40 > model-3-test

g2p.py --model model-3 --ramp-up --train cmudict_SPHINX_40 --devel 5% --write-model model-4
g2p.py --model model-4 --test cmudict_SPHINX_40 > model-4-test

g2p.py --model model-4 --ramp-up --train cmudict_SPHINX_40 --devel 5% --write-model model-5
g2p.py --model model-5 --test cmudict_SPHINX_40 > model-5-test

g2p.py --model model-5 --ramp-up --train cmudict_SPHINX_40 --devel 5% --write-model model-6
g2p.py --model model-6 --test cmudict_SPHINX_40 > model-6-test

g2p.py --model model-6 --ramp-up --train cmudict_SPHINX_40 --devel 5% --write-model model-7
g2p.py --model model-7 --test cmudict_SPHINX_40 > model-7-test

You can check the error rate at the end of each ‘*-test’ file. Model #7 should already give you relatively low error values, you may continue this training procedure until the error rate is satisfying.

Finally, apply your generated model to your new words.

g2p.py --model model-7 --apply words.txt > words.dic

Creating the language model

After creating your dictionary, it is also possible to build a language model using a tool called SRILM. It can be obtained through this website: http://www.speech.sri.com/projects/srilm/download.html . I have used version 1.7.2 in the tutorial. After setting it up, run the following scripts.

Convert the corpus to uppercase, because the dictionary is in uppercase format

cat corpus.txt | tr '[:lower:]' '[:upper:]' > corpus_upper.txt

Generate a language model from the corpus

ngram-count -text corpus_upper.txt -order 3 -limit-vocab -vocab words.txt -unk -map-unk "<unk>" -kndiscount -interpolate -lm lm.arpa

Merging the input files

Now that we have our own dictionary and language model files, it is time to merge them with the one that is used in the original model. I’ve done this “manually” with the small python script I’ve wrote:

import io
import re
import sys
import collections

if len(sys.argv) != 7:
    print(sys.argv[0] + " <base_dict> <base_lm> <ext_dict> <ext_lm> <merged_dict> <merged_lm>")
    sys.exit(2)

[basedic, baselm, extdic, extlm, outdic, outlm] = sys.argv[1:7]

print("Merging dictionaries...")

words = collections.OrderedDict()
for dic in [basedic, extdic]:
    with io.open(dic, 'r+') as Fdic:
        for line in Fdic:
            arr = line.strip().replace("\t", " ").split(" ", 1) # Sometimes tabs are used
            [word, pronunciation] = arr
            word = word.lower()
            if word not in words:
                words[word] = set([pronunciation.lower()])
            else:
                words[word].add(pronunciation.lower())

with io.open(outdic, 'w', newline='\n') as Foutdic:
    for word in words:
        for pronunciation in words[word]:
            Foutdic.write(word + " " + pronunciation + "\n")

print("Merging language models...")

# Read LM entries - works only on 3-gram models at most
grams = [[], [], []]
for lm in [baselm, extlm]:
    with io.open(lm, 'r+') as Flm:
        mode = 0
        for line in Flm:
            line = line.strip()
            if line == "\\1-grams:": mode = 1
            if line == "\\2-grams:": mode = 2
            if line == "\\3-grams:": mode = 3
            arr = line.split(" ")
            if mode > 0 and len(arr) > 1:
                if mode == 1 or mode == 2:
                    word = " ".join(arr[2:-1] if mode < 3 else arr[2:])
                    word = word.lower()
                grams[mode - 1].append(line.lower())

with io.open(outlm, 'w', newline='\n') as Foutlm:
    Foutlm.write(
       u"\\data\\\n" +
       u"ngram 1=" + str(len(grams[0])) + "\n"
       u"ngram 2=" + str(len(grams[1])) + "\n"
       u"ngram 3=" + str(len(grams[2])) + "\n"
    )
    for i in range(3):
        Foutlm.write(u"\n\\" + str(i+1) + u"-grams:\n")
        for g in grams[i]:
            Foutlm.write(g + "\n")

    Foutlm.write(u"\n\\end\\\n")

You might have to unzip the language model first:

gunzip -k data/local/lm/3gram-mincount/lm_unpruned.gz

The next script assumes you are in the ‘new’ directory. If you have used the CMU web service, you will need the .dic and the .lm files.

python mergedicts.py ../data/local/dict/lexicon4_extra.txt ../data/local/lm/3gram-mincount/lm_unpruned words.dic lm.arpa merged-lexicon.txt merged-lm.arpa

Note: all the script does is a simple “merge” of the model entries. It does not take the possible change in statistical probabilities for each original word into account, by the effect of the new added domain-specific words. To do a perfect merge, we would need to have the original text which is not available for free. Still even by using this simple method, I could get quite good results on recognizing new words or sentences.

Recompile the HCLG.fst

Now we have all the input files we need to re-compile the HCLG.fst using our new lexicon and grammar files.

Make sure you are in the ‘kaldi/egs/aspire/s5’ folder.

Let’s assume that:

  • you have created a ‘new/local/dict’ folder and copied over the following files from ‘data/local/dict’:
    • extra_questions.txt
    • nonsilence_phones.txt
    • optional_silence.txt
    • silence_phones.txt
  • you have copied your merged dictionary file into this folder as ‘lexicon.txt’
  • you have created a new/local/lang folder with the merged language model in it called ‘lm.arpa’
# Set up the environment variables (again)
. cmd.sh
. path.sh

# Set the paths of our input files into variables
model=exp/tdnn_7b_chain_online
phones_src=exp/tdnn_7b_chain_online/phones.txt
dict_src=new/local/dict
lm_src=new/local/lang/lm.arpa

lang=new/lang
dict=new/dict
dict_tmp=new/dict_tmp
graph=new/graph

# Compile the word lexicon (L.fst)
utils/prepare_lang.sh --phone-symbol-table $phones_src $dict_src "<unk>" $dict_tmp $dict

# Compile the grammar/language model (G.fst)
gzip < $lm_src > $lm_src.gz
utils/format_lm.sh $dict $lm_src.gz $dict_src/lexicon.txt $lang

# Finally assemble the HCLG graph
utils/mkgraph.sh --self-loop-scale 1.0 $lang $model $graph

# To use our newly created model, we must also build a decoding configuration, the following line will create these for us into the new/conf directory
steps/online/nnet3/prepare_online_decoding.sh --mfcc-config conf/mfcc_hires.conf $dict exp/nnet3/extractor exp/chain/tdnn_7b new

Let’s test our newly made model now (the configuration, word symbol-table and the HCLG graph paths are updated):

online2-wav-nnet3-latgen-faster \
  --online=false \
  --do-endpointing=false \
  --frame-subsampling-factor=3 \
  --config=new/conf/online.conf \
  --max-active=7000 \
  --beam=15.0 \
  --lattice-beam=6.0 \
  --acoustic-scale=1.0 \
  --word-symbol-table=new/graph/words.txt \
  exp/tdnn_7b_chain_online/final.mdl \
  new/graph/HCLG.fst \
  'ark:echo utterance-id1 utterance-id1|' \
  'scp:echo utterance-id1 <your_8khz_file.wav>|' \
  'ark:/dev/null'

So there you have it! You may now use it to decode your audio files, or write a web-server around it and serve it as a transcription REST interface, even write a HTML5 & JavaScript backed frontend to transcribe stuff you say to your microphone instantly. It’s up to you how you use this amazing tool that is available for you thanks to the people developing Kaldi and the enormous research they’ve put behind it.

Links and more background materials:

5 thoughts on “Kaldi ASR: Extending the ASpIRE model

  1. Great work ! Do you know if it is possible to learn another language using this method (for exemple by only keeping the new vocabulary)?If it is not, is there a way to modify the pre trained acoustic model in order to do it ?

    Like

    1. Thanks. Unfortunately I think it is not possible since another language would require a different acoustic model. Modifying an existing acoustic model is something I did not get into, so I cannot say much about it, but I believe it is a much more complex task than modifying the language model or dictionary entries. However if you can obtain a complete chain (including acoustic model) for that other language, you may be able to modify the words in it, in a similar way as in this blog entry.

      Like

    1. First you need to change the last line “ark:/dev/null” to something like “ark,t:output.lat”.
      This way you can get the lattices in a text format. You can then select the best available lattice:
      lattice-best-path –acoustic-scale=0.1 “ark,t:output.lat” “ark,t:output.txt”
      output.txt contains the best matching predicted text, but in dictionary ID numbers.
      You can finally get a human readable form by converting them to word symbols using the dictionary:
      %kaldi-path%/egs/wsj/s5/utils/int2sym.pl -f 2- %path-to-you-dictionary%/words.txt < output.txt

      Alternatively you can modify the "online2-wav-nnet3-latgen-faster.cc" & recompile 🙂 since it kind of does the same post-processing, but AFAIR you cannot make it write the final prediction to a file directly, just the lattices.

      Liked by 1 person

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s