报告题目： Matrix variate and tensor representation for voice conversion – from vector to matrix
宣讲嘉宾： Daisuke Saito (Assistant Professor, the University of Tokyo)
In statistical speech processing, ‘vector’ is one of the fundamental representations for all of the features, e.g. MFCC in automatic speech recognition, mel-cepstrum coefficients in statistical parametric speech synthesis, and so on. However, there is a limit to describe various kinds of information in speech by a single vector representation. To compensate for the lack of the representation, concatenation approaches are widely used to make much more representative ‘vector’, such as adding delta and acceleration parameters in ASR and SPSS, or GMM supervector approach in speaker recognition. In this talk, another approach, usage of matrix or tensor representation rather than concatenated vector is discussed. We show two cases of usage of matrix or tensor representation in voice conversion studies, matrix variate Gaussian mixture models for voice conversion and tensor factor analysis for speaker representation.
讲座题目：Speech structure; human-inspired representation of speech acoustics — What are "really" speaker-independent features?
主讲嘉宾： Nobuaki Minematsu(Professor, the University of Tokyo)
Speech signals covey various kinds of information, which are broadly grouped into two kinds, linguistic and extra-linguistic information. Many speech applications, however, focus on only a single aspect of speech. For example, speech recognizers try to extract only word identity from speech signals and speaker recognizers extract only speaker identity. Here, irrelevant features are often treated as hidden or latent by applying the probability theory to a large number of samples or irrelevant features are normalized to have quasi-standard values. In speech analysis, however, phases are usually removed, not hidden or normalized, and pitch harmonics are also removed, not hidden or normalized. The resulting speech spectrum still contains both linguistic information (word identity) and extra-linguistic information (speaker identity). Is there any good method to remove extra-linguistic information from the spectrum? In this talk, our proposal of "really" speaker-independent or speaker-invariant features, called speech structure, is explained. Speaker variation can be modeled as feature space transformation and our speech structure model is based on the transform-invariance of f-divergence. This proposal was inspired by findings in classical studies of structural phonology and recent studies of developmental psychology. In this talk, we show how we technically implemented findings obtained in phonology and psychology as computational module and we also show some examples of applying speech structure to speech applications. If time allows, we show some relationships between speech structure and DNN-based robust feature extraction.
报告题目： Divergence estimation based on deep neural networks and its use for language identification
主讲嘉宾：Yosuke Kashiwagi (PHD, the University of Tokyo)
Since statistical divergence is generally defined as a functional of two probability density functions, these density functions are usually represented in a parametric form.
Then, if a mismatch exists between the assumed distribution and its true one, the obtained divergence becomes erroneous.In our proposed method, by using Bayes' theorem, the statistical divergence is estimated by using DNN as discriminative estimation model.
In our method, the divergence between two distributions is able to be estimated without assuming a specific form for these distributions.When the amount of data available for estimation is small, however, it becomes intractable to calculate the integral of the divergence function over all the feature space and to train neural networks.To mitigate this problem, two solutions are introduced; a model adaptation method for DNN and a sampling approach for integration.We apply this approach to language identification tasks, where the obtained divergences are used to extract a speech structure.Experimental results show that our approach can improve the performance of language identification by 10.85% relative compared to the conventional approach based on i-vector.