Voice or Noise

Nov. 6, 2017, by Dominik Schiller

An important part of many classification systems that are working with a continuous input signal is the fragmentation of this signal into segments that are meaningful to solve the respective task. For any voice related task like speech recognition or emotion analysis from an audio signal for example we would like to identify the segments where the user is actually speaking. In order to do so we need a system that automatically can differentiate a human voice from any other noise that might be present in the signal. Distinguishing between a voice and sounds that originate from other sources is a task that feels quite natural for humans. But like many things that are easy to do for us it is a non-trivial job for a computer to figure out whether the input signal is containing voice or noise. The task is getting even more challenging in KRISTINA when we consider the technical infrastructure of the project and the characteristics and restrictions that are resulting from it.


Since KRISTINA lives in your web browser a user can establish a connection from anywhere in the world using the hardware of his or her choice. While this is a huge benefit in terms of usability and accessibility it also means that signal quality will vary largely between different sessions since most users will use different microphones and most places have a unique ambient noise level.

Given the target group of the project we are also aiming at keeping the technical knowledge that is required to run KRISTINA to an absolute minimum. This also means that it should not be necessary for the user to finetune parameters in the system to adjust it to his environment.

To keep the sensory equipment as unobtrusive as possible the usage of KRISTINA does not require the user to wear a headset. As a result, we have to take into account that everything KRISTINA says could be reproduced over speakers and therefore be detected by the microphone and being forwarded into the system again. The decentralized architecture of the system makes the implementation of an echo cancelation that prevents KRISTINA from reacting to her own voice extremely challenging.

The current solution to those problems has been established in the form of a push-to-talk-button which will only forward the audio signal to the next module while it is being kept pressed. Although this simple mechanism works excellent to provide only voiced parts of the audio to the system it also somewhat prevents the authenticity of a natural conversation, since the user is always required to let KRISTINA now when meaningful information is coming in by pressing a button which is usually not the case in human-human communication.

To find a better solution for a general, robust and automatic voice activity detection we are currently exploring the possibilities to train a machine learning  system which not only relies on the clues from the audio signal but also takes the video into account. Intuitively this makes sense since speaking involves the movement of the lips and chin which can be observed from the camera image and are not influenced by ambient sound. Indeed studies have shown that taking visual features into account can improve the quality of  a voice activity detection system significantly [1][2][3].


[1]      M. Burlick, D. Dimitriadis, and E. Zavesky, “On the improvement of multimodal voice activity detection,” in Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2013, pp. 685–689.

[2]      D. Dov, R. Talmon, and I. Cohen, “Audio-Visual Voice Activity Detection Using Diffusion Maps,” IEEE/ACM Trans. Audio, Speech Lang. Process., vol. 23, no. 4, pp. 732–745, 2015.

[3]      Y. Abe and A. Ito, “Multi-modal Voice Activity Detection by Embedding Image Features into Speech Signal,” 2013 Ninth Int. Conf. Intell. Inf. Hiding Multimed. Signal Process., pp. 271–274, Oct. 2013.

multilingual intelligent embodied agent social competence adaptive dialogue expressive speech recognition and synthesis discourse generation vocal facial and gestural social and emotional cues