Latent Semantic Analysis(LSA)튜 토리 얼 잠재 적 의미 분석 LSA 소개 5

6483 단어 Collections 문서.each Matrix Semantic Numbers

WangBen 20110916 Beijing
Part 3 - Usingthe Singular Value Decomposition
기이 치 분해 사용
Oncewe have built our (words by titles) matrix, we call upon a powerful but littleknown technique called Singular Value Decomposition or SVD to analyze thematrix for us. The"SingularValue Decomposition Tutorial" is a gentle introduction for readers that want to learn moreabout this powerful and useful algorithm.
일단 우리 가 단어 부터 제목 까지 의 행렬 을 만 들 면 우 리 는 매우 강력 한 도구 인'기이 치 분해'를 이용 하여 이 행렬 을 분석 할 수 있다."Singular Value Decomposition Tutorial 은 그 소개 이다.
Thereason SVD is useful, is that it finds a reduced dimensional representation ofour matrix that emphasizes the strongest relationships and throws away thenoise. In other words, it makes the best possible reconstruction of the matrixwith the least possible information. To do this, it throws out noise, whichdoes not help, and emphasizes strong patterns and trends, which do help. Thetrick in using SVD is in figuring out how many dimensions or"concepts" to use when approximating the matrix. Too few dimensionsand important patterns are left out, too many and noise caused by random wordchoices will creep back in.
SVD 가 매우 유용 한 이 유 는 우리 행렬 의 강 위 를 찾 을 수 있 기 때문이다.그 는 그 중에서 비교적 강 한 관 계 를 강화 하고 소음 을 버 렸 기 때문이다.(이 알고리즘 은 이미지 압축 에 도 자주 사용 된다)가능 한 한 적은 정보 로 행렬 전 체 를 완벽 하 게 재건 할 수 있다 는 얘 기다.이 를 위해 쓸모없는 소음 을 버 리 고 그 자체 의 강 한 패턴 과 추 세 를 강화한다.SVD 의 기 교 를 이용 하여 이 행렬 을 얼마나 차원(개념)으로 평가 하 는 지 찾 는 것 이다.너무 적은 차원 은 중요 한 모델 을 버 리 고 반대로 너무 많은 차원 에서 소음 을 도입 할 수 있다.
TheSVD algorithm is a little involved, but fortunately Python has a libraryfunction that makes it simple to use. By adding the one line method below toour LSA class, we can factor our matrix into 3 other matrices. The U matrixgives us the coordinates of each word on our “concept” space, the Vt matrixgives us the coordinates of each document in our “concept” space, and the Smatrix of singular values gives us a clue as to how many dimensions or“concepts” we need to include.
여기 서 SVD 알고리즘 을 소개 하 는 것 은 매우 적 지만 다행히도 python 은 간단 하고 사용 하기 좋 은 라 이브 러 리(scipy 는 잘 설치 되 지 않 음)가 있 습 니 다.다음 코드 에서 보 듯 이 우 리 는 LSA 류 에 코드 를 추가 했다.이 코드 는 행렬 을 다른 세 개의 행렬 로 분해 했다.매트릭스 U 는 모든 단어 가 우리 의'개념'공간 에 있 는 좌 표를 알려 준다.매트릭스 Vt 는 모든 문서 가 우리 의'개념'공간 에 있 는 좌 표를 알려 준다.기이 한 값 매트릭스 S 는 우리 에 게 차원 수량의 단 서 를 어떻게 선택 하 는 지 알려 준다(왜 그런 지 연구 해 야 한다).

def calc(self):

self.U, self.S, self.Vt = svd(self.A)

Inorder to choose the right number of dimensions to use, we can make a histogramof the square of the singular values. This graphs the importance each singularvalue contributes to approximating our matrix. Here is the histogram in ourexample.
적당 한 차원 의 수량 을 선택 하기 위해 서 우 리 는 기이 한 값 제곱 의 직사 도 를 만 들 수 있다.그것 은 모든 기이 한 값 이 행렬 을 추산 하 는 데 있어 서 의 중요 도 를 묘사 했다.아래 그림 은 우리 이 예 의 직사 도 이다.모든 기이 한 값 의 제곱 은 중요 도 를 대표 하 는데 다음 그림 은 귀 일화 후의 결과 이다)
Forlarge collections of documents, the number of dimensions used is in the 100 to500 range. In our little example, since we want to graph it, we’ll use 3dimensions, throw out the first dimension, and graph the second and thirddimensions.
대규모 문서 에 대해 차원 선택 은 100 에서 500 범위 입 니 다.우리 의 예 에서 우 리 는 도표 로 마지막 결 과 를 보 여 주 려 고 하기 때문에 우 리 는 3 차원 을 사용 하여 첫 번 째 차원 을 버 리 고 두 번 째 차원 과 세 번 째 차원 으로 그림 을 그 릴 계획 이다(왜 첫 번 째 차원 을 버 렸 습 니까?).
Thereason we throw out the first dimension is interesting. For documents, thefirst dimension correlates with the length of the document. For words, it correlateswith the number of times that word has been used in all documents. If we hadcentered our matrix, by subtracting the average column value from each column,then we would use the first dimension. As an analogy, consider golf scores. Wedon’t want to know the actual score, we want to know the score aftersubtracting it from par. That tells us whether the player made a birdie, bogie,etc.
우리 가 차원 1 을 버 린 이 유 는 매우 재미있다.문서 에 있어 서 첫 번 째 차원 은 문서 의 길이 와 관련 이 있다.단어 에 있어 서,그것 은 이 단어 가 모든 문서 에 나타 난 횟수 와 관계 가 있다(왜?).만약 우리 가 행렬(centering matrix)을 정렬 했다 면,각 열 을 통 해 각 열의 평균 값 을 빼 면,우 리 는 차원 1(왜?)을 사용 할 것 이다.골프 점수 같 아 요.우 리 는 실제 점 수 를 알 고 싶 지 않다.표준 점 수 를 뺀 점 수 를 알 고 싶다.이 점 수 는 이 선수 가 버디,이 글 등 을 쳤 다 는 것 을 알려 준다.
Thereason we don't center the matrix when using LSA, is that we would turn asparse matrix into a dense matrix and dramatically increase the memory andcomputation requirements. It's more efficient to not center the matrix and thenthrow out the first dimension.
우리 가 행렬 을 맞 추 지 않 은 이 유 는 정렬 후 희소 행렬 이 조밀 한 행렬 로 변 하기 때문에 메모리 와 계 산 량 을 크게 증가 시 킬 것 이다.더 효과 적 인 방법 은 행렬 을 정렬 하지 않 고 차원 1 을 버 리 는 것 이다.
Hereis the complete 3 dimensional Singular Value Decomposition of our matrix. Eachword has 3 numbers associated with it, one for each dimension. The first numbertends to correspond to the number of times that word appears in all titles andis not as informative as the second and third dimensions, as we discussed.Similarly, each title also has 3 numbers associated with it, one for eachdimension. Once again, the first dimension is not very interesting because ittends to correspond to the number of words in the title.
다음은 우리 행렬 이 SVD 를 거 친 후 3 차원 의 완전한 결과 입 니 다.모든 단 어 는 세 개의 숫자 와 관련 이 있 고 하 나 는 1 차원 을 대표 한다.첫 번 째 숫자 는 이 단어 가 등장 하 는 모든 횟수 와 관련 이 있 고 2 차원 과 3 차원 보다 정 보 량 이 많 지 않다 는 것 에 대해 논의 한 적 이 있다.이와 유사 하 게 모든 제목 은 세 개의 숫자 와 관련 이 있 고 하 나 는 1 차원 을 대표 한다.마찬가지 로 우 리 는 모든 표제 어의 수량 과 관계 가 있 는 경향 이 있 기 때문에 1 차원 에 관심 이 없다.

book
0.15
-0.27
0.04
dads
0.24
0.38
-0.09
dummies
0.13
-0.17
0.07
estate
0.18
0.19
0.45
guide
0.22
0.09
-0.46
investing
0.74
-0.21
0.21
market
0.18
-0.30
-0.28
real
0.18
0.19
0.45
rich
0.36
0.59
-0.34
stock
0.25
-0.42
-0.28
value
0.12
-0.14
0.23
*
3.91
0
0
0
2.61
0
0
0
2.00
*
T1
T2
T3
T4
T5
T6
T7
T8
T9
0.35
0.22
0.34
0.26
0.22
0.49
0.28
0.29
0.44
-0.32
-0.15
-0.46
-0.24
-0.14
0.55
0.07
-0.31
0.44
-0.41
0.14
-0.16
0.25
0.22
-0.51
0.55
0.00
0.34

이 내용에 흥미가 있습니까?

현재 기사가 여러분의 문제를 해결하지 못하는 경우 AI 엔진은 머신러닝 분석(스마트 모델이 방금 만들어져 부정확한 경우가 있을 수 있음)을 통해 가장 유사한 기사를 추천합니다:

[집합] Collections 도구 클래스 사용 팁 요약

Collections는 배열을 전문적으로 조작하는 도구 클래스입니다.일반적인 방법은 다음과 같습니다. 요소의 정렬, 조회, 수정 등 조작을 제공하고 집합 대상을 불가변류로 설정하며 집합 대상에 대해 동기화 제어를 실...

텍스트를 자유롭게 공유하거나 복사할 수 있습니다.하지만 이 문서의 URL은 참조 URL로 남겨 두십시오.

CC BY-SA 2.5, CC BY-SA 3.0 및 CC BY-SA 4.0에 따라 라이센스가 부여됩니다.

log4j 설정 오류 기록 사용

링크 ux c 의 syscall 사용 예

좋은 웹페이지 즐겨찾기

개발자 우수 사이트 수집

개발자가 알아야 할 필수 사이트 100선 추천 우리는 당신을 위해 100개의 자주 사용하는 개발자 학습 사이트를 정리했습니다