Henry Choi: ASR on a phone

My greatest frustration with Google Now is the latency. I understand that the amount of computation and data required for modern ASR does not fit on a phone, hence the need to ship off the compressed voice data to the Google backend, and eat the latency penalty. But if we want a Her-like speech interface, ASR will HAVE TO run on a phone-scale resource. Since I have some experience with SW optimization, I looked for an ASR (automatic speech recognition) SW package I could play around with, and found Sirius: an open source IPA (intelligent personal assistant).

U. Michgan Sirius

Sirius bundles together the following open source projects to deliver IPA feature:

CMU Sphinx: widely used GMM (Gaussian mixture Model) ASR SW.
RWTH's RASR
Kaldi: DNN (deep neural network) based ASR

OpenEphyra: question and answer system, based on IBM's Watson
SURF: image matching algorithm implemented using OpenCV

The GMM scores HMM state transitions by mapping an input feature vector into a multi-dimensional coordinate system and iteratively scores the features against the trained acoustic model. DNN is defined by the number of hidden layers where scoring amounts to one forward pass through the network. In recent years, industry and academia have moved towards DNN over GMM due to its higher accuracy. Text output from ASR is passed to the Q&A system, which uses 3core processes to extract textual information:

word stemming ("elected" --> "elect"). Porter stemming
regular expression matching ([#th] --> #)
part-of-speech tagging, using CRF (conditional random field)

QA service takes more time than ASR, and is more variable, primarily because of the time to select the most fitting answer.

Sirius Suite (C/C++) benchmark suite captures the bottlenecks:

ASR: GMM/DNN scoring, rather than HMM

GMM: nested loops that iteratively score the feature vector against training data (acoustic model, language model, dictionary). Store entire data required for GMM in 2 GB.

QA: all 3 core processes (above)

Stemmer: check for multiple variables of a word (suffix, etc)

Building and running sirius-suite on Ubuntu 14

I downloaded sirius-suite-1.1 and sirius-caffe-1.0 (dependency), which requires cmake (to build yet another dependency: OpenCV). I advise you run this before the official install process (the sirius developers missed these packages possibly because they are on Ubuntu 12):

$ sudo apt-get install cmake protobuf-compiler liblmdb-dev python-numpy-dev aptitude
$ sudo aptitude install libhdf5-serial-dev

Manually override the offered choice and accept an alternative, to resolve the dependencies for libhdf5-serial-dev.

Then follow the instruction for sirius-caffe link above to build both cafe and the sirius-suite. The suite needs to be told where sirius-caffe libs are, so run the test command with LD_LIBRARY_PATH, like this:

$ LD_LIBRARY_PATH=/mnt/work/CL/sirius/caffe/distribute/lib make test
{
"kernel":"gaussian_mixture_model",
"abrv":"gmm",
"gmm": 13.183000
}

...

I don't know how to interpret the tests yet, so I keep moving to the individual kernel testing. Even though the QA system is the tallest pole to whack for improved user experience, I am new to NLP (natural language processing), so I start where there is some physical signal processing, which is currently my expertise area.

GMM ASR scoring

In the gmm/ folder of the sirius suite, the build created only 2 folders: baseline and pthread, because I have not attached an ML605 FPGA (which I've had for a few years already) or a CUDA GPU. My laptop actually has a CUDA GPU (Quadro K2100M), so I'll see if I can run the CUDA version too--but later. This Sirius suite "kernel" comes with its (one) test data: gmm_data.txt, which is more than 100 MB untarred. It looks like a time series of feature vector (N=29) time series, which looks like this (only the 1st vectors shown, because for each sample--5120--means and precs are matrices while weight and factor are vectors):

There is another vector: feature_vector (confusing!), which is just a bias for each features.
Without any idea of the underlying physical data (what was uttered), I'll try to understand the GMM code. The score computation goes like this:

ln(Val2) = weight_f
+ \frac{1}{ln(1.0001)} \sum_f {prec_f (C_f - means_f)^2}
- factor_f
\\
ln\Delta = |score - ln(Val2)|
\\
ln(Highest) = \begin{cases}
score & score - ln(Val2) < 0\\
ln(Val2) & else
\end{cases}

add /usr/local/texlive/2015/bin/x86_64-linux to your PATH

DNN (deep neural network) scoring

Machine learning is all the rage these days perhaps because the compute power available is finally allowing the learning algorithms to yield useful result. Supposedly, DNN is 1 such example.

ASR on Raspberry Pi 3

My Intel quad-core laptop, with 16 GB RAM does not have a mobile-device spec. It would be great to run Sirius on an Android phone, but my phone is not open to running a stock Linux either. Plus, I cannot add GPIO HW easily. Except for somewhat stingy RAM size, the recently released Raspberry Pi 3 has a reasonable mobile-device spec:

CPU: Broadcom BCM2837 (quad ARM Cortex-A53--64-bit ARMv8 architecture)
Memory: 1 GB LP DDR2
GPU: VideoCore IV

I prefer to roll my own Linux distribution with Buildroot. The latest Buildroot even supports RPi 3 out of the box, so I will only need to modify the packages.

~/work/CL$ git clone git://git.buildroot.net/buildroot rproot
~/work/CL/rproot$ make raspberrypi3_defconfig
~/work/CL/rproot$ make xconfig

The Buildroot default configuration wants to create an SD card image, whereas I prefer to NSF mount the rootfs during development.

Henry Choi

Feb 2, 2017

ASR on a phone

U. Michgan Sirius

Building and running sirius-suite on Ubuntu 14

GMM ASR scoring

DNN (deep neural network) scoring

ASR on Raspberry Pi 3

Followers

Blog Archive