VGTU talpykla >
Doktorantūros skyrius / Department for Doctoral Studies >
Technologijos mokslų daktaro disertacijos ir jų santraukos >
Please use this identifier to cite or link to this item:
|Title: ||Improving the effectiveness of voice activation systems with machine learning methods|
|Other Titles: ||Balso aktyvavimo sistemu˛ efektyvumo gerinimas naudojant mašininio mokymosi metodus|
|Authors: ||Kolesau, Aliaksei|
|Issue Date: ||18-Jul-2022|
|Publisher: ||Vilnius Gediminas Technical University|
|Citation: ||Kolesau, A. 2022. Improving the effectiveness of voice activation systems with machine learning methods: doctoral dissertation. Vilnius: Vilnius Gediminas Technical University, 158 p.|
|Abstract: ||Modern devices are more frequently equipped with voice control. Using speech to operate is a natural and efficient approach. However, incorporating voice control poses the following problem: how users should initiate a dialogue. One way is to process the whole audio stream. This solution raises privacy concerns and requires a lot of computational and network resources. Also, the problem of differentiating the user’s command and non-relevant speech appears in such a setup. The alternative is to use a voice activation: initiating a voice dialogue by pronouncing the pre-defined keyword or keyphrase. The problem of finding such pre-defined keyphrase in the audio stream is called “keyword spotting” and can be adequately solved with the computational resources in modern embedded devices. This mitigates the privacy problems, requires less computational resources, and simplifies the problem of detecting a non-relevant speech: if the keyphrase is a-prior rare, its utterance is a good signal, that the user wants to start an interaction.
Since the task of formulating an algorithm for determining whether a keyphrase has been uttered in an audio stream is difficult, it is not surprising that machine learning methods have long been used for voice activation systems. Most of modern research in keyword spotting is focused on well-resourced setups and tuning and proposing deep learning methods to improve the detection quality.
This dissertation consists of an introduction, three main chapters, and general conclusions. The first chapter reviews existing research on voice activation systems, proposes the general structure of such a system and introduces the Lithuanian dataset for training a voice activation system. The second chapter investigates acoustic feature pipelines used in modern voice activation systems, studies the simplification trend in acoustic feature extraction, and shows the importance of pipeline hyperparameter tuning for achieving the best results. Also, the chapter discusses acoustic units used in keyword spotting and how to use training samples with these units effectively. The approach to detect keyword repeats to improve the recall of a voice activation system is proposed at the end of the second chapter. Chapter three presents new methods to build a voice activation system in a low-resource setup.
The performed experiments and analysis have demonstrated that all main parts of a voice activation system (acoustic feature pipeline, acoustic model, and decoding) are important for achieving a good detection quality. Also, the proposed pre-training method for voice activation in a low-resource setup shows the accuracy improvements of a voice activation system by 10% when the number of samples per keyword is seven or less and by 29% if the number of samples per keyword is five or less. Furthermore, the best accuracy for the Lithuanian dataset was improved from 89.23% to 93.85% by the proposed joint training method.|
|Description: ||Doctoral dissertation|
|Appears in Collections:||Technologijos mokslų daktaro disertacijos ir jų santraukos|
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.