Get the full manuscript at tdx.cat
Abstract: Today, massive amounts of audiovisual content are being generated, stored, released and delivered, in part due to the virtually unlimited storage capacity, the access to the necessary media to produce them by anybody and anywhere, and the ubiquitous connectivity provided by the Internet. In this context, suitable, affordable and sustainable content management which enables searching and retrieving information of interest is a must. Since manual handling of such amount of data is intractable, it is here where speech processing techniques may play a crucial role in the automatic tagging and annotation of audiovisual content. The task of speaker diarization (also known as the “who spoke when” task) has become a key process as a supporting technology for further speech processing systems, such as automatic speech recognition and automatic speaker recognition, used for the automatic extraction of metadata from spoken documents. Among the massive amount of audiovisual content being created, there can be recurrent speakers who participate in several sessions within a collection of audiovisual sessions. For instance, in TV and radio content one can frequently find recurrent speakers such as public figures, journalists, presenters, anchors, and so on. Due to the local nature of current speaker diarization technology (systems work on a single-session basis), an arbitrary recurrent speaker will likely receive different local abstract identifiers among the different sessions where he/she participates. In this situation, it would be more meaningful that the recurrent speakers receive the same global, abstract ID along all sessions. This task is known as cross-session speaker diarization. Current state-of-the-art speaker diarization systems have achieved very good performance, but usually at the cost of long processing times. This limitation on execution time makes current systems not suitable for large-scale, real-life applications, and becomes even more evident in the task of cross-session speaker diarization. In this thesis, the fast speaker diarization approach based on binary key speaker modeling is taken to a next level with the aim of bringing it closer to state-of-the-art performance while preserving high speed rates that enable the processing of large audio collections in competitive times. Furthermore, a new cross-session speaker diarization system based on binary key speaker modeling is proposed by following the same previously established goals: competitive performance with short execution times. As a result of this thesis, we propose a new improved single-session speaker diarization system which exhibits a 16% relative improvement in performance with regard to a baseline binary key system (15.15% DER opposed to 18.22% DER, being DER the diarization error rate), while being 7 times faster (0.035xRT against 0.252xRT, being xRT the real-time factor) and 28 times faster than real time. As for cross-session speaker diarization, in this thesis we propose a binary system whose performance is just slightly below (3.5% absolute DER) the performance of its single-session counterpart, while presenting a real-time factor of 0.036xRT. Furthermore, our approach has been shown to successfully scale for processing audio collection of several hundreds of hours.