Spark MLlib-Iris(붓꽃)데이터 세트 LogisticRegression(논리 회귀)

    :http://download.csdn.net/download/dr_guo/9946656
    :Spark 1.6.1; Scala 2.10.4; JDK 1.7

상세 한 것 은 주석 을 보십시오.
package com.beagledata.test

import org.apache.spark.mllib.classification.{LogisticRegressionWithSGD,LogisticRegressionWithLBFGS}
import org.apache.spark.mllib.classification.LogisticRegressionModel
import org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.mllib.evaluation.MulticlassMetrics
import org.apache.log4j.{Level,Logger}
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.sql.SQLContext
import org.apache.spark.mllib.linalg.{Vector, Vectors}  
import org.apache.spark.mllib.regression.LabeledPoint


object IrisLogisticModelTest extends App{

  val conf = new SparkConf().setAppName("IrisLogisticModelTest")
  .setMaster("local")

  val sc = new SparkContext(conf)
  val sqlContext = new SQLContext(sc)
  import sqlContext.implicits._

  // load data  
  val rddIris = sc.textFile("data/IrisData2.txt")
  //rddIris.foreach(println)

  case class Iris(a:Double, b:Double, c:Double, d:Double, target:Double)

  //LabeledPoint  label target    double  , 0.00.0 1.0
  val dfIris = rddIris.map(_.split(",")).map(l => Iris(l(0).toDouble,l(1).toDouble,l(2).toDouble,l(3).toDouble,l(4).toDouble)).toDF()

  dfIris.registerTempTable("Iris")  

  //sqlContext.sql("""SELECT * FROM Iris""").show

  // Map feature names to indices
  val featInd = List("a", "b", "c", "d").map(dfIris.columns.indexOf(_))

  // Get index of target
  val targetInd = dfIris.columns.indexOf("target") 

  val labeledPointIris = dfIris.rdd.map(r => LabeledPoint(
   r.getDouble(targetInd), // Get target value
   // Map feature indices to values
   Vectors.dense(featInd.map(r.getDouble(_)).toArray)))

  // Split data into training (80%) and test (20%).
  val splits = labeledPointIris.randomSplit(Array(0.8, 0.2), seed = 11L)
  val trainingData = splits(0)
  val testData = splits(1)
  /*println("trainingData--------------------------->")
  trainingData.take(5).foreach(println)
  println("testData------------------------------->")
  testData.take(5).foreach(println)*/


  /*// Run training algorithm to build the model  
  val lr = new LogisticRegressionWithSGD().setIntercept(true)
  lr.optimizer
    .setStepSize(10.0)
    .setRegParam(0.0)
    .setNumIterations(20)
    .setConvergenceTol(0.0005)
  val model = lr.run(trainingData)*/
  val numiteartor = 2
  //val model = LogisticRegressionWithSGD.train(trainingData, numiteartor)    
  val model = new LogisticRegressionWithLBFGS().setNumClasses(numiteartor).run(trainingData)

  //  
   val labelAndPreds = testData.map { point =>
    val prediction = model.predict(point.features)
    (point.label, prediction)

  }
  println("labelAndPreds------------------------->")
  labelAndPreds.take(5).foreach(println)
  //     
  val metrics = new MulticlassMetrics(labelAndPreds)
  val precision = metrics.precision
  println("Precision = " + precision)

}
       :

1.Input validation failed

검 사 를 통 해 데이터 가 집 중 된 label 은 0 에서 시작 되 는 것 이 아니 라 LabeledPoint 에서 두 가지 데 이 터 를 분류 해 야 하 는 라벨 은 double 형식 이 고 0.0 부터 시작 되 며 라벨 은 0.0 과 1.0 이 며 1.0 과 2.0 등 이 될 수 없습니다.
2.bad symbolic reference. A signature in GeneralizedLinearAlgorithm.class refers to term internal
in package org.apache.spark which is not available. It may be completely missing from the current classpath,
or the version on the classpath might be incompatible with the version used when compiling GeneralizedLinearAlgorithm.class.

spark-mllib jar 패키지,spark-assembly-1.6.1-hadoop 2.6.0.jar 에 spark-mllib 가 포함 되 어 있 습 니 다.반복 충돌,삭제 후 해결.

좋은 웹페이지 즐겨찾기