๐Ÿ‘ฉ๐Ÿปโ€๐Ÿ’ปStandford CS231N Deep Learning -3

5930 ๋‹จ์–ด cs231n๋”ฅ๋Ÿฌ๋‹cs231n

2020/09/12 ๊นƒํ—™ ๋ธ”๋กœ๊ทธ ๊ธฐ๋ก

3. Loss Function and Optimization

Linear Classifier (๋ณต์Šต)

  • W์˜ ๊ฐ ํ–‰์€ class์˜ classifier๋กœ ์ƒ๊ฐํ•˜๊ฑฐ๋‚˜ ๊ฐ class์˜ template (prototype)์œผ๋กœ ์ƒ๊ฐํ•  ์ˆ˜ ์žˆ๋‹ค
  • Wx+b ์—์„œ (W,b)์Œ์„ ๊ธฐ์–ตํ•˜๊ธฐ ๋ณด๋‹จ, x์˜ ์ฒซ col์— 1์„ ์ถ”๊ฐ€ํ•˜๊ณ , W์™€ b๋ฅผ ํ•œ matrix๋กœ ์ €์žฅํ•˜๋Š” ๊ฒƒ. ๊ทธ๋ž˜์„œ ๊ต์ˆ˜๋‹˜์ด ๋งจ๋‚  ์•ž์— 1์ด ์žˆ๋Š” col์„ ์ถ”๊ฐ€ํ•œ๊ฑฐ์˜€๋‹ค...

Loss Function

  • loss function = cost function = objective function
  • ์–ผ๋งˆ๋‚˜ ๋ชจ๋ธ์ด ์ข‹์€๊ฐ€?
  • loss ๊ฐ’์ด ์ž‘์„ ์ˆ˜๋ก ์ข‹๋‹ค

Multiclass SVM( Support Vector Machine) loss

  • correct class์˜ score๊ฐ€ incorrect class์— ๋น„ํ•ด ์ ์–ด๋„ ฮ”๋งŒํผ์€ ์ปค์•ผํ•œ๋‹ค
  • j๋Š” correct class, s ๋Š” f(x, W)์˜ score๋ฅผ ์˜๋ฏธ
	def L_i_vecorized(x, y, W):
		scores = W.dot(x) #๋‚ด์ 
		margins = np.maximum(0, scores - scores[y] + 1) #margins[y]๋Š” 0์ด์•ผํ•˜๋Š”๋ฐ 1์ด ๋จ
		margins[y] = 0 #์—ฌ๊ธฐ์„œ ๋ฐ”๊ฟ”์คŒ
		loss_i = np.sum(margins)
		return loss_i
  • loss ๊ณ„์‚ฐ ์˜ˆ์‹œ
  • W์˜ ์ดˆ๊ธฐ๊ฐ’์œผ๋กœ ์•„์ฃผ ์ž‘์€ ๊ฐ’๋“ค์„ ๋„ฃ์œผ๋ฉด, ๋ชจ๋“  s๋„ ๊ฑฐ์˜ 0์— ๊ฐ€๊นŒ์šด ๊ฐ’์œผ๋กœ ๋‚˜์˜จ๋‹ค. ๊ทธ๋Ÿฌ๋ฉด loss๊ฐ€ class์˜ ๊ฐฏ์ˆ˜์™€ ๊ฐ™๋‹ค.(๋””๋ฒ„๊น…์— ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค)
  • sum ๋Œ€์‹  mean์„ ์‚ฌ์šฉํ•ด๋„ scale๋งŒ ๋‹ฌ๋ผ์ง€์ง€ ์ƒ๊ด€ ์—†๋‹ค
  • hinge loss
    - threshold๊ฐ€ 0์ธ loss

    - (min, max) = (0, โˆž)
    - squared hinge

    ๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ๋ฒ”์œ„๋Š” ๊ฐ™์ง€๋งŒ, loss๋“ค์˜ ํฌ๊ธฐ ๋ณ€ํ™”๊ฐ€ ๋” ๊ทน์‹ฌํ•ด์ง„๋‹ค. ์ž‘์€ ์ฐจ์ด s๋ฅผ ํฐ loss์˜ ์ฐจ์ด๋กœ ๋งŒ๋“ค๊ณ  ์‹ถ์€ ๊ฒฝ์šฐ ์ด๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค

Regularization

  • loss๊ฐ€ 0์ธ W๋Š” ์—ฌ๋Ÿฌ๊ฐœ๊ฐ€ ๋  ์ˆ˜ ์žˆ๋‹ค
  • Regularization penalty์ธ R(W) ๋ฅผ loss function์— ๋ง๋ถ™์—ฌ ๋” ์ตœ์ ์˜ W๋ฅผ ์ฐพ๋Š”๋‹ค

    - ๊ฐ๋งˆ๋Š” 1๋ณด๋‹ค ํฐ ์ƒ์ˆ˜ (์ฃผ๋กœ cross-validation์œผ๋กœ ๊ฒฐ์ •)
    - W๊ฐ’๋งŒ ๊ฐ€์ง€๊ณ  ๊ณ„์‚ฐํ•˜๊ธฐ ๋•Œ๋ฌธ์— data์™€๋Š” ์ „ํ˜€ ๋ฌด๊ด€ํ•˜๋‹ค
    - loss์— ์ด ๊ฐ’์ด ํฌํ•จ๋˜๋Š”๊ฑฐ๋ผ W์ค‘ ๊ฐ€์žฅ ์ž‘์€ W๋ฅผ ๊ตฌํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ด์ค€๋‹ค
    - R(W)๋กœ๋Š” ๋‹ค์–‘ํ•œ ์‹์„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค.

Softmax Classifier

  • SVM๊ณผ ํ•จ๊ป˜ ๋งŽ์ด ์“ฐ์ด๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜
  • binary logistic regression์˜ ์‘์šฉ
  • cross entropy loss
  • ์ดˆ๊ธฐ W๊ฐ€ ์ž‘์•„์„œ s ๊ฐ€ ๋‹ค 0์— ๊ฐ€๊นŒ์šฐ๋ฉด loss๋Š” log C ๊ฐ€ ๋จ (C๋Š” class ๊ฐฏ์ˆ˜)
  • ์ •๊ทœํ™” ํŒŒ๋ผ๋ฏธํ„ฐ์ธ ๋žŒ๋‹ค์˜ ๊ฐ’์ด ์ปค์ง€๋ฉด softmax ๊ฐ’์€ ์ž‘์•„์ง„๋‹ค

Optimization

  • Descent Gradient
    - ๋‚ด๋ฆฌ๋ง‰๊ธธ์„ ์ฐพ์•„ ๊ทธ ๋ฐฉํ–ฅ์œผ๋กœ ๊ฐ€๋Š” ๊ฒƒ
    - ๋ฏธ๋ถ„๊ฐ’ ์ด์šฉ
		while True:
			weights_grad = evaluate_gradient(loss_fun, data, weights)
			weigths += -step_size * weights_grad
  • Mini-batch Gradient Descent
    - ๋ชจ๋“  training data๋ฅผ ์ด์šฉํ•ด gradient ์ฐพ์œผ๋ ค๋ฉด ๋„ˆ๋ฌด expensive
    - minibatch๋กœ ๋‚˜๋ˆ ์„œ
    - data์˜ ๊ฐ sample๋“ค์ด ์„œ๋กœ correlatedํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๊ฐ€๋Šฅ
  • SGD(Stochastic Gradient Descent)
    - mini batch์— ์ ์šฉํ•œ ๋ฐฉ์‹์„ ํ•˜๋‚˜์˜ sample๋กœ
    - ์ผ๋ฐ˜์ ์œผ๋กœ SGD๊ฐ€ minibatch ๋ฐฉ์‹์„ ๋œปํ•จ (ํ˜ผ์šฉ)
    - minibatch์˜ ํฌ๊ธฐ๋กœ 2์˜ ์ œ๊ณฑ์ˆ˜ ์‚ฌ์šฉ
	while True:
			data_batch = sample_trainig_data(data, 256)
			weigts_grad = evaluage_gradient(loss_fun, data_batch, weights)
			weights += -step_size*weights_grad

๊ผผ์ง€๋ฝ

Batch Normalization์— ๋Œ€ํ•œ ๋…ผ๋ฌธ์„ ์ฝ์œผ๋ฉด์„œ ๋”ฅ๋Ÿฌ๋‹๊ณผ ๊ด€๋ จ๋œ ๊ธฐ์ดˆ ์ง€์‹์„ ๋งŽ์ด ์ฐพ์•„๋ดค๋‹ค. ๊ทผ๋ฐ ๊ทธ ๋‚ด์šฉ์„ ๋‹ค ์ด ๊ฐ•์˜์—์„œ ๋‹ค์‹œ ๋“ฃ๊ฒŒ ๋˜์„œ ์ดํ•ดํ•˜๊ธฐ๊ฐ€ ์กฐ๊ธˆ ์‰ฌ์›Œ์„œ ์ข‹๊ธฐ๋„ ํ•˜๋ฉด์„œ... ๋„ˆ๋ฌด ์•„๋Š”๊ฑฐ ์—†์ด ๋…ผ๋ฌธ์— ๋ค๋ณ๋‚˜? ํ•˜๋Š” ์ƒ๊ฐ๋„ ๋“ค์—ˆ๋‹ค. ๊ทธ๋ž˜๋„ ๊ฐ•์˜๋Š” ์žฌ๋ฐŒ์—ˆ๋‹ค :)

์ข‹์€ ์›นํŽ˜์ด์ง€ ์ฆ๊ฒจ์ฐพ๊ธฐ