Description
This research is focused on application of methods of matrix factorization in blind source separation i.e. discovering meaningful connections or subsets of variables in a large amount of data (measurements) with originally big number of observed variables (parameters). Every meaningful subset of such variables can be treated as a blind source. This is useful in situations where a smaller number of sources, usually much smaller than the number of observed variables, well describe observed situation. With such partitions of variables onto sources we can later better understand the subsystems that affect our observed data. In the literature blind source separation is used in face recognition [1] [2], identification of handwritten digits [3], document clustering [4], discovering of blind pollution sources, [5] [6] [7] [8] [9] [10], and speech recognition [11]. In all above mentioned fields there is a common fact that the measured or observed variables are positive and that the applied methods are matrix factorizations. We will describe three methods of matrix factorization and their application in blind source separation. The first is principal component analysis – factorial analysis (PCA-FA), the second is nonnegative matrix factorization (NMF), and the third is positive matrix factorization (PMF). We can say that PMF is an upgraded NMF method that uses not only measured data but includes also the uncertainty of measurements. The first part will be dedicated to the presentation of mathematical formulations of the methods and algorithms with emphasis on differences between methods. In the second part, an application of the methods on two different datasets will be presented. First dataset is a simulated dataset, the second one is a dataset of real measurement of environmental data which is supplied with EPA PMF application [12] [13]. Simulated data [14] are of utmost importance in such an analysis, because in simulated dataset we know the structure of the data and this is crucial for comparison of the methods. Repeatedly simulated datasets and their decomposition give us the possibility for statistical analysis of differences between methods upon the decomposition of all simulated datasets. In the real dataset we do not have the possibility of repetitions and consequently for statistical analysis. Also, there is a need of additional knowledge of subsystems of observed data for correct selection of the number of sources. For practical decomposition of data matrix, FApca and nnmf methods in Matlab were used. For PMF method, EPA PMF program was used. Further research of methods will be done with emphasis on combining methods with the goal of obtaining better identification of the sources, and with improvement of objective functions.
Primary authors
Dr
Janez Žibert
(University of Primorska)
Mr
Pavel Fičur
(University of Primorska)