Recently, speaker diarization based on speaker embeddings has shown excellent results in many works. In this paper we propose several enhancements throughout the diarization pipeline. This work addresses two clustering frameworks: agglomerative hierarchical clustering (AHC) and spectral clustering (SC).
First, we use multiple speaker embeddings. We show that fusion of x-vectors and d-vectors boosts accuracy significantly.
Second, we train neural networks to leverage both acoustic and duration information for scoring similarity of segments or clusters. Third, we introduce a novel method to guide the AHC clustering mechanism using a neural network. Fourth, we handle short duration segments in SC by deemphasizing their effect on setting the number of speakers.
Finally, we propose a novel method for estimating the number of clusters in the SC framework. The method takes each eigenvalue and analyzes the projections of the SC similarity matrix on the corresponding eigenvector.
We evaluated our system on NIST SRE 2000 CALLHOME and, using cross-validation, we achieved an error rate of 5.1%, going beyond state-of-the-art speaker diarization.