Non-negative matrix factorization
#
Find similar titles
- 최초 작성자
- 최근 업데이트
Structured data
- Category
- Algorithm
Non-negative Matrix Factorization (NMF)는 선형 대수에서 사용하는 decomposition법의 하나로, 컴퓨터 비전, 텍스트마이닝, 생물정보학에 사용된다. 1970년대 초에 제안되어 1999년도에서야 광범하게 사용될 만한 알고리즘이 나왔다(^1). Lu(^2)의 글을 읽어보면 PCA와 비교하여 차이점을 잘 설명하고 있다. 비구조화된 자료에서 숨은 구조와 의미를 찾는 데이터 과학의 기법중 하나이다.
강병철은 PubMed의 초록을 텍스트 마이닝하기 위해서 파이썬 프로그램을 개발하였다(^3). 이를 자세히 설명하면, 표준 NMF 알고리즘은 \(n \times m\) 차원의 행렬 \(V\)를 \(WH\) 행렬로 인수분해하는 것이다. 이때 \(W\)는 \(n \times r\), \(H\)는 \(r \times m\)의 차원을 가지고 \(r < n\), \(r < m\)의 조건을 만족한다. 인수분해를 하는 것은 다음의 거리를 최소화하는 것을 찾는 것으로 실현된다.
$$ E(V-WH)^2 = \sum_{ij} (V_{ij} - (WH)_{ij} )^2 $$ $$ D(V||WH) = \sum_{ij} (V_{ij} \log( \frac{ V_{ij} }{ (WH)_{ij} })-V_{ij} + (WH)_{ij} ) $$
Cost 함수라 불리는 위 함수를 최소화하기 위해서는 초기값에서부터 \(W\)와 \(H\)에 대해서 다음의 규칙을 반복적으로 계산한다.
* Standard NMF Algorithms aim at factorizing an n*m matrix V into the product W*H of an n*r matrix (W) and an r*m matrix (H) with r<n and r<m. This is done by minimizing one of the following distances:
* E(V-WH)^2^ = Sum,,ij,,(V,,ij,,-(WH),,ij,,)^2^
* D(V||WH) = Sum,,ij,,(V,,ij,,log(V,,ij,,/(WH),,ij,,)-V,,ij,,+(WH),,ij,,)
* In order to minimize these so-called cost functions, iterative update rules for W and H have been developed in the beginning of this decade. The update rules are as follows:
* H,,as,, <- H,,as,,((W^T^V),,as,,/(W^T^WH),,as,,) and W,,ia,, <- W,,ia,,((VH^T^),,ia,,/(WHH^T^),,ia,,)
* H,,as,, <- H,,as,,(((Sum,,i,,(W,,ia,,V,,is,,))/((WH),,is,,))/Sum,,k,,(W,,ka,,))) and W,,ia,, <- W,,ia,,(((Sum,,s,,(H,,as,,V,,is,,)/(WH),,is,,))/(Sum,,t,,(H,,at,,))))
* This method can be used in biomedical data mining because it discovers latent semantic features. To this end, NMF recognizes local structures and takes them into account when factorizing the original matrix. For instance, this can be taken advantage of in document clustering of life science articles, where an application example of this unsupervised process takes a list of MEDLINE abstracts as input, defines keywords out of the abstracts of the papers and groups the documents into a certain number (r) of clusters of which each one is determined by certain keywords that represent the cluster's basis.
* The new developed NMF algorithm for biological and medical data mining shall help to improve the choice of the elements of the cluster bases. It is easy to see that the choice is essential first for the biological quality of the clusters and second for the clustering of the data itself. In the application example, the basis elements are the keywords.
* For this purpose, an m*m diagonal matrix M is generated, which contains weights according to the list of basis elements, i.e. M,,jj,, is the weight belonging to the j^th^ element. Clearly, a multiplication by V from the right side generates weighted columns of V, i.e. V*M = M,,ij,,V,,ij,, for all i,j and M,,ij,, = 0 for i<>j (i different from j).
* Consequently, the according cost functions that have to be minimized are:
* E,,M,,(V-WH)^2^ = Sum,,ij,,(M,,ij,,(V,,ij,,(WH),,ij,,)^2^)
* D(V||WH) = Sum,,ij,,(M,,ij,,(V,,ij,,log(V,,ij,,/(WH),,ij,,)-V,,ij,,+(WH),,ij,,))
* This can be done by implementing the following update rules:
* H,,as,, <- H,,as,,(Sum,,i,,(W^T^,,ai,,V,,is,,M,,is,,)/Sum,,i,,(W^T^,,ai,,(WH),,is,,M,,is,,)) and W,,ia,, <- W,,ia,,(Sum,,k,,(V,,ik,,M,,ik,,H^T^,,ka,,)/Sum,,k,,((WH),,ik,,M,,ik,,H^T^,,ka,,))
* H,,as,, <- H,,as,,((Sum,,i,,(W,,ia,,V,,is,,M,,is,,)/((WH),,is,,))/Sum,,k,,(M,,ka,,W,,ka,,)) and W,,ia,, <- W,,ia,,(((Sum,,s,,(H,,as,,V,,is,,M,,is,,)/(WH),,is,,))/(Sum,,t,,(H,,at,,M,,ta,,))))
삼국시대 이후에 제작된 도기와 토기의 산지를 추정하기 위해 적용한 사례가 있다(^4).
전철욱은 NMF, k-means, LDA, LSI를 이용하여 전현직 대통령의 연설문을 클러스터링하였다^5.