Spark 학습 노트: (4) MLlib 베이스

MLlib:Machine Learning Library.주요 내용은 다음과 같습니다.

데이터 유형

통계 도구

summary statistics

correlations

stratified sampling

hypothesis testing

random data generation

분류와 회귀

선형 모델(SVM, 논리 회귀, 선형 회귀)

소박 베일스

결정 트리

ensembles of trees

isotonic regression

협동 여과

ALS(alternating least squares)　　　　　

클러스터

k-means

고스 혼합 모델

power iteration clustering(PIC)

LDA(latent Dirichlet allocation)

유동식 k-means

다운그레이드

SVD

PCA

특징 추출과 전환

Frequent pattern mining

FP－growth

최적화

stochastic gradient descent

limited-memory BFGS (L-BFGS)

　　　　
I. 데이터 유형
Mllib의 데이터 형식은 주로 local vectors와local matrices이며 잠재적인 대수 조작은 Breeze와 jblas에서 제공합니다.
1. local vector는 int형과 더블형이 있는데 아래 표지는 0부터 시작하여 dense와sparse 두 종류로 나뉜다.
Local vector의 기본 유형은 DenseVector와 SparseVector입니다.

import org.apache.spark.mllib.linalg.{Vector, Vectors}    



// Create a dense vector (1.0, 0.0, 3.0).

val dv: Vector = Vectors.dense(1.0, 0.0, 3.0)

// Create a sparse vector (1.0, 0.0, 3.0) by specifying its indices and values corresponding to nonzero entries.

val sv1: Vector = Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0))

// Create a sparse vector (1.0, 0.0, 3.0) by specifying its nonzero entries.

val sv2: Vector = Vectors.sparse(3, Seq((0, 1.0), (2, 3.0)))

Scala imports scala.collection.immutable.Vector by default, so you have to import org.apache.spark.mllib.linalg.Vector explicitly to use MLlib’s Vector .

MLlib에서 학습을 감독하는 훈련 견본을'labeled point'라고 부른다.

import org.apache.spark.mllib.linalg.Vectors

import org.apache.spark.mllib.regression.LabeledPoint



// Create a labeled point with a positive label and a dense feature vector.

val pos = LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0))



// Create a labeled point with a negative label and a sparse feature vector.

val neg = LabeledPoint(0.0, Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0)))

MLlib supports reading training examples stored in LIBSVM format, which is the default format used by LIBSVM and LIBLINEAR .

import org.apache.spark.mllib.regression.LabeledPoint

import org.apache.spark.mllib.util.MLUtils

import org.apache.spark.rdd.RDD



val examples: RDD[LabeledPoint] = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")

2.local matrix
A local matrix has integer-typed row and column indices and double-typed values, stored on a single machine. MLlib supports dense matrices, whose entry values are stored in a single double array in column major.

import org.apache.spark.mllib.linalg.{Matrix, Matrices}



// Create a dense matrix ((1.0, 2.0), (3.0, 4.0), (5.0, 6.0))

val dm: Matrix = Matrices.dense(3, 2, Array(1.0, 3.0, 5.0, 2.0, 4.0, 6.0))

A distributed matrix has long-typed row and column indices and double-typed values, stored distributively in one or more RDDs.It is very important to choose the right format to store large and distributed matrices. A RowMatrix is a row-oriented distributed matrix without meaningful row indices, e.g., a collection of feature vectors. It is backed by an RDD of its rows, where each row is a local vector. We assume that the number of columns is not huge. An IndexedRowMatrix is similar to a RowMatrix but with row indices, which can be used for identifying rows and executing joins. A CoordinateMatrix is a distributed matrix stored in coordinate list (COO) format, backed by an RDD of its entries.

import org.apache.spark.mllib.linalg.Vector

import org.apache.spark.mllib.linalg.distributed.RowMatrix



val rows: RDD[Vector] = ... // an RDD of local vectors

// Create a RowMatrix from an RDD[Vector].

val mat: RowMatrix = new RowMatrix(rows)



// Get its size.

val m = mat.numRows()

val n = mat.numCols()



import org.apache.spark.mllib.linalg.distributed.{IndexedRow, IndexedRowMatrix, RowMatrix}



val rows: RDD[IndexedRow] = ... // an RDD of indexed rows

// Create an IndexedRowMatrix from an RDD[IndexedRow].

val mat: IndexedRowMatrix = new IndexedRowMatrix(rows)

// Drop its row indices.

val rowMat: RowMatrix = mat.toRowMatrix()



import org.apache.spark.mllib.linalg.distributed.{CoordinateMatrix, MatrixEntry}



val entries: RDD[MatrixEntry] = ... // an RDD of matrix entries

// Create a CoordinateMatrix from an RDD[MatrixEntry].

val mat: CoordinateMatrix = new CoordinateMatrix(entries)// Convert it to an IndexRowMatrix whose rows are sparse vectors.

val indexedRowMatrix = mat.toIndexedRowMatrix()

A BlockMatrix is a distributed matrix backed by an RDD of MatrixBlock s, where a MatrixBlock is a tuple of ((Int, Int), Matrix) , where the (Int, Int) is the index of the block, and Matrix is the sub-matrix at the given index. BlockMatrix supports methods such as add and multiply with another BlockMatrix .A BlockMatrix can be most easily created from an IndexedRowMatrix or CoordinateMatrix by calling toBlockMatrix .

import org.apache.spark.mllib.linalg.distributed.{BlockMatrix, CoordinateMatrix, MatrixEntry}



val entries: RDD[MatrixEntry] = ... // an RDD of (i, j, v) matrix entries

// Create a CoordinateMatrix from an RDD[MatrixEntry].

val coordMat: CoordinateMatrix = new CoordinateMatrix(entries)

// Transform the CoordinateMatrix to a BlockMatrix

val matA: BlockMatrix = coordMat.toBlockMatrix().cache()



// Validate whether the BlockMatrix is set up properly. Throws an Exception when it is not valid.

// Nothing happens if it is valid.

matA.validate()



// Calculate A^T A.

val ata = matA.transpose.multiply(matA)

어떤 모델이 사용되는지 먼저 설명을 보고 API doc:https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.package

이 내용에 흥미가 있습니까?

현재 기사가 여러분의 문제를 해결하지 못하는 경우 AI 엔진은 머신러닝 분석(스마트 모델이 방금 만들어져 부정확한 경우가 있을 수 있음)을 통해 가장 유사한 기사를 추천합니다:

Spark 팁: 컴퓨팅 집약적인 작업을 위해 병합 후 셔플 파티션 비활성화

작은 입력에서 UDAF(사용자 정의 집계 함수) 내에서 컴퓨팅 집약적인 작업을 수행할 때 spark.sql.adaptive.coalescePartitions.enabled를 false로 설정합니다. Apache Sp...

텍스트를 자유롭게 공유하거나 복사할 수 있습니다.하지만 이 문서의 URL은 참조 URL로 남겨 두십시오.

CC BY-SA 2.5, CC BY-SA 3.0 및 CC BY-SA 4.0에 따라 라이센스가 부여됩니다.

Spark 학습 노트: (4) MLlib 베이스

좋은 웹페이지 즐겨찾기