Phonetic audio mining, audio searching, speech analytics

What is audio mining?

Audio mining is a technique that is used to search audio for occurences of spoken words or phrases. Speech technology is used to recognise the words or phonemes that are spoken in an audio or video file, and audio mining searches can then be carried out to locate specific words and phrases within the audio. These audio mining searches run at speeds that are typically many thousands of times faster than real time, so large quantities of audio or speech can be searched in a short time.

Terminology

A number of different terms are used in connection with audio mining. These include: audio mining, audio indexing, phonetic searching, phonetic indexing, speech indexing, audio analytics, speech analytics, word spotting, information retrieval. Note that the terms "audio analytics" and "speech analytics" are often used to cover both audio mining and other speech analysis technologies, for example speaker identification - see separate speech analytics page.

Audio mining applications

Audio mining software can be used to search audio or video content that contains speech. Typical applications include searching large audio/media archives, where little or no information is available that describes the audio content. This could be used, for example, to retrieve relevant clips for a news story from a large video archive. Audio mining searches can typically be carried out many thousands of times faster than real time, which makes it possible to search large amounts of speech data when previously this was impossible, due to the time it would take for humans to listen to the material.

Audio mining techniques are also used in telephony applications, for example to help automate quality control aspects of the business where it is important to check that telephone agents actually said what they were supposed to say. Audio mining searches on the recorded calls can be made to locate words or phrases that must always be said. This can offer significant advantages in terms of the number of calls that can be checked as the speed at which relevant matches can be found using audio mining is much greater than can be achieved by traditional means (a human listening to the recorded calls).

Audio mining has also been used for captioning (subtitling) of TV and other video/media content, as the speech content associated with the text for each caption can be located by running a suitable audio mining search. However, a more effective and efficient way to obtain the start and end times of each word in the caption text is to use speech recognition to automatically align the known text with the speech.

Approaches

There are two common approaches to audio mining - one uses large vocabulary continuous speech recognition (LVCSR), and the other uses phonetic recognition to carry out phonetic audio mining. An overview of these two approaches to audio mining is given below.

LVCSR audio mining

This is a two-stage process. In the first stage (pre-processing or indexing stage), the speech content of the audio is processed by a large vocabulary recogniser to generate a searchable index file. The index file contains information about the sequences of words spoken in the audio or video data.

In the second stage (search stage), a search term is defined (e.g. a word or phrase), and one or more index files are searched for all occurences that match the specified search term. The results of the search can be displayed graphically as "search hits" in the audio file, or the relevant portions of the audio or video file can be played to the user.

Phonetic audio mining

Like LVCSR audio mining, phonetic audio mining is a two-stage process. In the first stage, audio is processed (indexed) with a phonetic recogniser to generate an index file. The index file produced by this phonetic approach to audio mining stores the phonetic content of the speech, in contrast to the index files generated by LVCSR methods, which contain information about words.

The second stage is similar to the LVCSR approach, in which a search term (word or phrase) is defined, and a number of phonetic index files are searched to retrieve matches for the search term. Here, the search term is converted into a phonetic sequence and it is matches for this phonetic sequence that are actually retrieved from the phonetic index files. This is in contrast to the LVCSR approach, where all matching is done with the text that corresponds to the word or phrase.

It is also possible to enter a phonetic search term directly, if the user has sufficient phonetic expertise to enter the sequence of phones that correspond to the pronuciation of the word or phrase they want to search for.

Phonetic vs large vocabulary approaches to audio mining

The differences between the phonetic and large vocabulary approaches to audio mining lead to contrasting advantages and disadvantages in the two approaches. These differences relate to the speed of the two audio mining stages (indexing and searching), how the technologies deal with out of vocabulary words, and how they should be used to maximise the effectiveness of the audio mining searches.

Indexing and search speed

One key difference between the phonetic audio mining approach and the large vocabulary approach relates to which stage (indexing or searching) is the most computationally intensive. With phonetic audio mining, the rate at which the audio content can be indexed is many times faster than with LVCSR techniques. During the search stage however, the computational burden is larger for phonetic search systems than for LVCSR approaches, where the search pass is typically a simple and lightweight operation.

Phonetic recognition does not require the use of complex language models: the phone recognition can be run effectively without knowledge of which phones were previously recognised. In contrast, knowledge of which words were previously recognised is vital for achieving good recognition accuracy in a large vocabulary system. LVCSR approaches must therefore use sophisticated language models, which leads to a much greater computation load at the indexing stage for LVCSR approaches and results in significantly slower indexing speeds. Phonetic audio mining software can index audio data at rates of around 100 times faster than real time, compared to speeds only a few times faster than real time for LVCSR systems. The reliance on complex language models also means that the data used to train the LVCSR systems must be well matched to the data it will be used on.

Vocabulary

An advantage of the phonetic search approach is that an open vocabulary is maintained, which means that searches for personal or company names can be performed without the need to reprocess the audio. With LVCSR systems, any word that was not known by the system at the time the speech was indexed can never be found. For example, imagine a new product called "terazap" became popular. This word will not be in the dictionary of words used by an LVCSR audio mining system, which means the recogniser can never output this word, even if the word was actually spoken in audio that was processed by the system. In order to find matches for this new word, the LVCSR system has to be updated with a new dictionary that contains the word "terazap", and all the audio has to be pre-processed again, which is a time-consuming task. This problem does not occur with phonetic audio mining systems because they work at the level of phones, not words. As long as a phonetic pronunciation for the word can be generated at search time, it will be able to find matches for the word, and no re-processing of audio is required.

In recent years, hybrid approaches have been used where some phonetic information is retained by large vocabulary systems, allowing them to tackle the problems of out-of-vocabulary items in a more effective way.

Existing audio mining and speech analytics technology

A number of different companies produce audio mining and speech analytics software, applications and SDKs. Below is a list of links to some of the available audio mining and speech analytics tools.

Aurix: audio miner SDK

Aurix audio miner - phonetic audio mining software (SDK).

Note that Aurix were acquired by Avaya in 2012 - see below.

Avaya Speech Analytics

Avaya speech analytics v2.0 is now available - more info at this link

BBN

BBN broadcast monitoring system - LVCSR audio mining system.

AVOKE caller experience analytics

CallMiner

CallMiner - Speech analytics software.

Nexidia - FastTalk

Nexidia - phonetic audio mining software ("phonetic search engine").

Nuance

Nuance speech products.

Univoc

Univoc KWS - phonetic audio mining product.

Witness Systems

Witness systems - Speech analytics for call centres.

Speech analytics and audio mining in the news

Try some Google news searches to check for current news articles about audio mining and speech analytics.

Last updated: March 2013