Spark 학습 노트: (4) MLlib 베이스

9637 단어 spark
MLlib:Machine Learning Library.주요 내용은 다음과 같습니다.
  • 데이터 유형
  • 통계 도구
  • summary statistics
  • correlations
  • stratified sampling
  • hypothesis testing
  • random data generation 

  • 분류와 회귀
  • 선형 모델(SVM, 논리 회귀, 선형 회귀)
  • 소박 베일스
  • 결정 트리
  • ensembles of trees
  • isotonic regression

  • 협동 여과
  • ALS(alternating least squares)     

  • 클러스터
  • k-means
  • 고스 혼합 모델
  • power iteration clustering(PIC)
  • LDA(latent Dirichlet allocation)
  • 유동식 k-means
  • 다운그레이드
  • SVD
  • PCA

  • 특징 추출과 전환
  • Frequent pattern mining
  • FP-growth

  • 최적화
  • stochastic gradient descent
  • limited-memory BFGS (L-BFGS)


  •     
    I. 데이터 유형
    Mllib의 데이터 형식은 주로 local vectors와local matrices이며 잠재적인 대수 조작은 Breeze와 jblas에서 제공합니다.
    1. local vector는 int형과 더블형이 있는데 아래 표지는 0부터 시작하여 dense와sparse 두 종류로 나뉜다.
    Local vector의 기본 유형은 DenseVector와 SparseVector입니다. 
    import org.apache.spark.mllib.linalg.{Vector, Vectors}    
    
    
    
    // Create a dense vector (1.0, 0.0, 3.0).
    
    val dv: Vector = Vectors.dense(1.0, 0.0, 3.0)
    
    // Create a sparse vector (1.0, 0.0, 3.0) by specifying its indices and values corresponding to nonzero entries.
    
    val sv1: Vector = Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0))
    
    // Create a sparse vector (1.0, 0.0, 3.0) by specifying its nonzero entries.
    
    val sv2: Vector = Vectors.sparse(3, Seq((0, 1.0), (2, 3.0)))

    Scala imports  scala.collection.immutable.Vector  by default, so you have to import  org.apache.spark.mllib.linalg.Vector  explicitly to use MLlib’s  Vector .
  • MLlib에서 학습을 감독하는 훈련 견본을'labeled point'라고 부른다.
  • import org.apache.spark.mllib.linalg.Vectors
    
    import org.apache.spark.mllib.regression.LabeledPoint
    
    
    
    // Create a labeled point with a positive label and a dense feature vector.
    
    val pos = LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0))
    
    
    
    // Create a labeled point with a negative label and a sparse feature vector.
    
    val neg = LabeledPoint(0.0, Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0)))
  •  MLlib supports reading training examples stored in  LIBSVM  format, which is the default format used by  LIBSVM  and  LIBLINEAR .
  • import org.apache.spark.mllib.regression.LabeledPoint
    
    import org.apache.spark.mllib.util.MLUtils
    
    import org.apache.spark.rdd.RDD
    
    
    
    val examples: RDD[LabeledPoint] = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")

    2.local matrix
    A local matrix has integer-typed row and column indices and double-typed values, stored on a single machine. MLlib supports dense matrices, whose entry values are stored in a single double array in column major.
    import org.apache.spark.mllib.linalg.{Matrix, Matrices}
    
    
    
    // Create a dense matrix ((1.0, 2.0), (3.0, 4.0), (5.0, 6.0))
    
    val dm: Matrix = Matrices.dense(3, 2, Array(1.0, 3.0, 5.0, 2.0, 4.0, 6.0))
  • A distributed matrix has long-typed row and column indices and double-typed values, stored distributively in one or more RDDs.It is very important to choose the right format to store large and distributed matrices.  A  RowMatrix  is a row-oriented distributed matrix without meaningful row indices, e.g., a collection of feature vectors. It is backed by an RDD of its rows, where each row is a local vector. We assume that the number of columns is not huge. An  IndexedRowMatrix  is similar to a  RowMatrix  but with row indices, which can be used for identifying rows and executing joins. A  CoordinateMatrix  is a distributed matrix stored in coordinate list (COO) format, backed by an RDD of its entries.
  • import org.apache.spark.mllib.linalg.Vector
    
    import org.apache.spark.mllib.linalg.distributed.RowMatrix
    
    
    
    val rows: RDD[Vector] = ... // an RDD of local vectors
    
    // Create a RowMatrix from an RDD[Vector].
    
    val mat: RowMatrix = new RowMatrix(rows)
    
    
    
    // Get its size.
    
    val m = mat.numRows()
    
    val n = mat.numCols()
    
    
    
    import org.apache.spark.mllib.linalg.distributed.{IndexedRow, IndexedRowMatrix, RowMatrix}
    
    
    
    val rows: RDD[IndexedRow] = ... // an RDD of indexed rows
    
    // Create an IndexedRowMatrix from an RDD[IndexedRow].
    
    val mat: IndexedRowMatrix = new IndexedRowMatrix(rows)
    
    // Drop its row indices.
    
    val rowMat: RowMatrix = mat.toRowMatrix()
    
    
    
    import org.apache.spark.mllib.linalg.distributed.{CoordinateMatrix, MatrixEntry}
    
    
    
    val entries: RDD[MatrixEntry] = ... // an RDD of matrix entries
    
    // Create a CoordinateMatrix from an RDD[MatrixEntry].
    
    val mat: CoordinateMatrix = new CoordinateMatrix(entries)// Convert it to an IndexRowMatrix whose rows are sparse vectors.
    
    val indexedRowMatrix = mat.toIndexedRowMatrix()

     
  • BlockMatrix  is a distributed matrix backed by an RDD of  MatrixBlock s, where a  MatrixBlock  is a tuple of  ((Int, Int), Matrix) , where the  (Int, Int)  is the index of the block, and  Matrix  is the sub-matrix at the given index.  BlockMatrix  supports methods such as  add  and  multiply  with another  BlockMatrix .A  BlockMatrix  can be most easily created from an  IndexedRowMatrix  or  CoordinateMatrix  by calling  toBlockMatrix .
  • import org.apache.spark.mllib.linalg.distributed.{BlockMatrix, CoordinateMatrix, MatrixEntry}
    
    
    
    val entries: RDD[MatrixEntry] = ... // an RDD of (i, j, v) matrix entries
    
    // Create a CoordinateMatrix from an RDD[MatrixEntry].
    
    val coordMat: CoordinateMatrix = new CoordinateMatrix(entries)
    
    // Transform the CoordinateMatrix to a BlockMatrix
    
    val matA: BlockMatrix = coordMat.toBlockMatrix().cache()
    
    
    
    // Validate whether the BlockMatrix is set up properly. Throws an Exception when it is not valid.
    
    // Nothing happens if it is valid.
    
    matA.validate()
    
    
    
    // Calculate A^T A.
    
    val ata = matA.transpose.multiply(matA)

     
    어떤 모델이 사용되는지 먼저 설명을 보고 API doc:https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.package

    좋은 웹페이지 즐겨찾기