As we mentioned before, Speech Analytics has become an essential tool for providing insights and benefits to all departments in a company, but in a call center the implementation of the process brings added value because here is the first contact of a customer with a business.
In a previous blog post, we elaborated the idea that Automatic Speech Recognition solutions have become a “Must Have” nowadays, and this is due to the evolution of technology, global events, and the growth mindset. If in the 2000’s it was a “nice-to-have”, it is clear that today it has become a “must-have”.
Brief description of Automatic Speech Recognition technology
Automatic Speech Recognition (ASR) is a technology that enables a program to process human speech into a written format. It is often confused with voice recognition, but speech recognition focuses on translating a speech from a verbal format into text. Therefore, it is also known as computer speech recognition or speech-to-text.
ASR is a high-tech tool that after recognizing and understanding the machine turns the speech signal to the corresponding text or command. ASR includes the extraction and determination of the acoustic feature, the acoustic model, and the language model.
If you want to know more about the Speech Analytics process, access our blog post in which we tell all the details about it >>
Automatic Speech Recognition Process - RepsMate Engine
Voice activity detection
For the STT (Speech-To-Text) task, we focus only on the speech content, so we isolate the parts containing speech by applying a VAD (Voice Activity Detection) network. In our architecture, this VAD network will output voice segments between [0.2, 30] seconds.
1. Speaker recognition
We apply speaker recognition if the VoIP hardware does not have a separate channel for each speaker. After this process, the audio segments may be split in smaller subsegments.
Each resulting segment is passed to the STT model, being processed in the following stages:
- One segment is further split in overlapping windows of a few dozen to a few hundred milliseconds;
- Each window will be encoded to a vector space of latent feature representations, each having perceptive field of multiple windows of a few hundred milliseconds;
- For each resulting context tensors, a token is predicted under a conditional probability using an alignment-free algorithm named CTC (Cross-Temporal- Classification). A token, in this case, can be either a letter or a symbol which can denote any acoustic descriptor (either noise, unintelligible chatter, pause, filler sounds etc.);
- Given the conditional probabilities for the resulting tokens, the transcript candidates are selected using a heuristic search algorithm named beam search, which will output a number of candidate text sequences;
- Due to the conditional independence assumption of the CTC algorithm, the text alignments may be erroneous, that is why a conditional dependent text model such as an autoregressive language model is used to further evaluate the most plausible transcript. The semantic contextual information captured by the language model is weighted with the acoustic-based CTC+beam search text alignments to produce a more plausible text transcript.
3. Unsupervised learning and Supervised learning
The described task is a supervised task, therefore a STT model will need annotated transcripts on which to train. Due to the sparsity of labelled audio data, especially in low resource languages, an unsupervised solution has been adopted to improve the STT performance. This solution aims to improve the acoustic information modelled by latent feature representations in an unsupervised manner, using unlabeled audio inputs. This is done by encoder-decoder transformer networks such as wav2vec2, which mask randomly chosen audio windows and tries to predict the missing information by interpolation. This task of improving audio interpolation will result in a model able to capture context representations rich in acoustic information, which can further be used on the downstream supervised task of textual decoding using CTC and beam search.
4. Domain specific noise
A robust model may handle both cases well, but a domain-specific ASR model will outperform. The quality of the training dataset is important, that is why most VoIP-specific ASR datasets are rare, available mostly only in English and are not open-sourced due to the confidentiality of the information. Our experiments show a high decrease in performance if the training dataset does not contain relevant audio files related to the domain of interest.
Some key aspects which are particular to call-center conversations are:
- Voice intonation, voice accents, filler sounds (such as ‘ah’ or ‘uhm’), stops or pauses in mid-sentence. These are common in telephone conversations but rare in studio recordings;
- Also, another aspect which will differ from publicly available recorded audio transcripts is the noise induced by audio transmission and audio compression algorithms, such as VoIP telephony specific compression algorithms and the noise of the recording equipment;
- Background noise, background chatter which are specific to call-center conversations, such as when the rep speaks from an open-space environment with colleagues engaged in other conversations or when the customer is outside in a crowded space. Most public ASR datasets are recorded inside a relatively isolated room; therefore, they would not capture this type of noise which is common in call-center conversations;
- Conversation specific language, consisting of small phrases which are highly dependent on the conversation context. Most language models are trained on very large datasets consisting of publicly available text on the internet in a specific language, mostly captured by web-crawlers’ algorithms, such as Wikipedia articles, newsletters, transcript speeches (from parliament or news-related), product reviews etc. These domains (news, articles, public speeches, interviews) have a specific language and may not capture the dynamics and contextual information of a conversation. Therefore, labeled telephone conversations are a key factor in improving ASR’s language model performance, that’s why a labeling team focused on domain-specific data is key to an ASR system’s performance.
There are many data augmentation techniques which can simulate or synthesize background noise, background chatter and crosstalk, but our experiments show that a clean dataset is much more valuable.
5. Domain adaptation
We consider dominion-adaptation of our ASR systems a key aspect in a high-performing ASR system, therefore we have developed modules which:
- automatically train and finetunes the ASR model to the audio data provided from the client;
- automatically finetune the language model on any client provided text transcripts, which will expand the language model’s dictionary and will focus its attention on these transcripts, considering them more plausible thus detecting them more easily;
- provide an interface where the transcript can be corrected from the platform, so that the language model will be fine-tuned to correct the mistake and improve. The correction will be used to simulate a large set of simulations where the correction may be present and on which the language model will be finetuned;
- provide an interface where paperless text can be provided and, if the paperless is detected, the model will output a similarity score. This is an important asset for quality management. Additionally, a negative paperless version can be provided, so that the model will detect the negative version and output a similarity score to it, enabling a decision if the detected paperless is valid or not by its similarity to the positive example and negative example.
RepsMate ASR Engine Results
On the unsupervised learning task, we have pretrained our model on 1000 hours of unlabeled audio call-center conversations. This unsupervised training task resulted in an acoustic model able to capture context representations rich in acoustic information, which will further be used on the downstream supervised task of semantic textual decoding.
For the supervised task, our labeling team produced 50 hours of annotated call-center conversations. For evaluation, we have separated 500 audio sentences from the labelled call-center dataset which were not used for training purposes, and have achieved less than 5% – word error rate (over 95% accuracy).
We achieve an 83% correlation pearson coefficient on the STS (Semantic Textual Similarity) task with our finetuned language model, being able to detect a wide range of key phrases or words.
On the NER(Named-Entity-Recognition) task, we achieve an F1-score of 85% using our finetuned NER transformer-based language model being able to recognize names, organizations, personal information, or number IDs.
Our ASR language model, used to improve ASR performance, has a perplexity score of 4.2 of call-center specific textual information. This model has been pretrain on a 16Gb corpus of publicly available internet text and finetuned on call-center specific text.
For domain adaptation, we only require between 0 and 2 hours of labelled audio recordings to boost our ASR accuracy. The accuracy can also be boosted only by providing textual information.
For a savvy business, the implementation of Speech Recognition technology can bring great value. We list below 5 obvious benefits of using ASR:
- Exceptional customer experience
- Reduced expenses
- New cross-selling and up-selling opportunities
- Minimized customer abandonment
- Increases productivity
We believe that ASR presents organizations with a great opportunity to gain more insights into their business, customers, and market growth.
For more information on how to get started with automatic speech recognition technology, explore our website.