Mahout: CVB

1802 단어 Mahout
When run cvb, there is a error
org.apache.hadoop.io.Text cannot be cast to org.apache.hadoop.io.IntWritable
Solution:
the new LDA requires SequenceFile<IntWritable, VectorWritable> as input 
(the same disk format as DistributedRowMatrix), which you can get out of 
SequenceFile<Text, VectorWritable> by running the 
RowIdJob ("$MAHOUT_HOME/bin/mahout rowid -h" for more details) before running CVB.

 
 
 
Interpret the result 
doc-topic
mahout vectordump 
-i   hdfs://192.168.122.1:2014/user/zhaohj/mahout/topics/lda/doc-topic  
-o data/lda/doc-topic       
-sort true  -vs 1  -p true 

 Note: -vs 1 just dump the first topic a document belongs to, such as 
#doc-index    topic-id:properblity
0	      {1:0.9999999918613426}
1	      {2:0.999999958633294}
2	      {0:0.9999999872590848}
3	      {0:0.9999999914501596}

 Warning: don't provide -d option to dump doc-topic, otherwise you' ll get meanless output.
 
 
 
topic-term
mahout vectordump
-i   hdfs://192.168.122.1:2014/user/zhaohj/mahout/topics/lda/topic-term 
-o data/lda/topic-term       
-d hdfs://192.168.122.1:2014/user/zhaohj/mahout/topics/docsvectors3/dictionary.file-0  
-dt sequencefile  
 -sort true  -vs 5  -p true

 
 
 
References
http://mail-archives.apache.org/mod_mbox/mahout-user/201205.mbox/%3CCAG3i8Se1QobSPpw8ewgNkjVw_Zd_8crb6Z18_7G5Yqew1XRTAw@mail.gmail.com%3E 
 
 http://stackoverflow.com/questions/21318459/how-to-run-mahout-cvb-on-reuters-news-on-cloudera-vm-cdh4-5-as-lda-is-not-longer

좋은 웹페이지 즐겨찾기