Spark 설치

13137 단어 hadoophadoop

1. 개요

1) 특징

  • 하둡기반 오픈소스 분산처리 시스템
  • 맵리듀스와 유사한 일괄처리 기능
  • 실시간 데이터처리기능 (Spark Streaming)
  • SQL과 유사한 정형 데이터 처리 기능 (Spark SQL)
  • 머신러닝 알고리즘 (Spark MLlib)

2) 장점

  • 데이터를 메모리에 캐시로 저장 (속도업)
  • 하둡의 단점인 데이터를 재사용하는 반복적인 알고리즘도 효율적으로 처리
  • 하둡의 맵리듀스는 main, mapper, reducer 3개의 클래스가 필요하지만 Spark에서는 간단한 코드로 처리가 가능하다.

3) 단점

  • 데이터셋이 작을경우 Spark로 실행하면 오히려 성능이 떨어진다.

4) 실행방법

  • 지원언어: 스칼라, 자바, 파이썬, R
  • 대화형 Spark shell 제공
  • Spark standalone cluster, Hadoop YARN cluster
  • yarn에서 돌리려면 master에만 설치하면되지만, Standalone mode를 위해서는 모든 slave에 설치/배포

2. 설치

  • yarn위에서 pyspark를 실행 할 예정이다.

1) 다운로드

wget https://downloads.apache.org/spark/spark-3.0.2/spark-3.0.2-bin-hadoop3.2.tgz

2) 환경변수 추가

export SPARK_HOME="/opt/spark-3.0.2"
export PYSPARK="/opt/anaconda3"
export PYSPARK_PYTHON="/opt/anaconda3/bin/python"

export PATH=$HADOOP_HOME/sbin:$SPARK_HOME/bin:$SPARK_HOME/sbin:

## pyspark으로 실행 할 수 있도록 설정
export PYSPARK_DRIVER_PYTHON="jupyter"
export PYSPARK_DRIVER_PYTHON_OPTS="notebook --allow-root"

3) 파일 수정

  • conf/spark-env.sh
export JAVA_HOME="/usr/lib/jvm/java-11-openjdk-amd64"
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
  • conf/spark-default.sh
spark.master spark://aidw-001:7077
spark.driver.memory 30g
spark.driver.maxResultSize 4g
spark.executor.memory 30g
spark.executor.cores 2
spark.executor.heartbeatInterval 1000s
spark.network.timeout 2000000s

spark.eventLog.enabled           true
spark.eventLog.dir               hdfs:///sparklog
spark.history.fs.logDirectory    hdfs:///sparklog
spark.history.provider           org.apache.spark.deploy.history.FsHistoryProvider

spark.sql.warehouse.dir /user/hive/warehouse

4) log4j.properties

log4j.rootCategory=INFO, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n

# Set the default spark-shell log level to WARN. When running the spark-shell, the
# log level for this class is used to overwrite the root logger's log level, so that
# the user can have different defaults for the shell and regular Spark apps.
log4j.logger.org.apache.spark.repl.Main=WARN

# Settings to quiet third party logs that are too verbose
log4j.logger.org.sparkproject.jetty=WARN
log4j.logger.org.sparkproject.jetty.util.component.AbstractLifeCycle=ERROR
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO
log4j.logger.org.apache.parquet=ERROR
log4j.logger.parquet=ERROR

# SPARK-9183: Settings to avoid annoying messages when looking up nonexistent UDFs in SparkSQL with Hive support
log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL
log4j.logger.org.apache.hadoop.hive.ql.exec.FunctionRegistry=ERROR

5) slaves

aidw-001
aidw-002
aidw-003

6) 실행

root@aidw-001:/opt/spark-3.0.2/sbin# ./start-all.sh
starting org.apache.spark.deploy.master.Master, logging to /opt/spark-3.0.2/logs/spark-root-org.apache.spark.deploy.master.Master-1-aidw-001.out
aidw-002: starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark-3.0.2/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-aidw-002.out
aidw-001: starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark-3.0.2/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-aidw-001.out
aidw-003: starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark-3.0.2/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-aidw-003.out
aidw-005: starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark-3.0.2/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-aidw-005.out
aidw-004: starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark-3.0.2/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-aidw-004.out

## worker, master가 실행됨 확인
root@aidw-001:/opt/spark-3.0.2/sbin# jps
137714 Master
134770 JournalNode
114496 QuorumPeerMain
135334 NodeManager
134523 DataNode
135163 ResourceManager
134360 NameNode
137961 Jps
137881 Worker
135005 DFSZKFailoverController

root@aidw-005:/usr/lib/jvm/java-8-openjdk-amd64# jps
3206993 NodeManager
3223091 Worker
3223909 Jps
3223784 Main
3206814 DataNode

7) pyspark 패키지 설치

conda install pyspark
conda install -c conda-forge findspark

8) jupyter notebook 실행 설정

jupyter notebook --generate-config
vi ~/.jupyter/jupyter_notebook_config.py

266번 라인 수정
c.NotebookApp.notebook_dir = '/home/user/work'

9) pyspark 실행

root@aidw-001:/opt/spark-3.0.2/sbin# pyspark
[I 16:45:36.970 NotebookApp] Serving notebooks from local directory: /home/aidw/work
[I 16:45:36.970 NotebookApp] The Jupyter Notebook is running at:
[I 16:45:36.970 NotebookApp] http://aidw-001:8888/?token=5ac5bb4710694c7adabffc3004aabfa40639b6eac2ee0de3
[I 16:45:36.970 NotebookApp]  or http://127.0.0.1:8888/?token=5ac5bb4710694c7adabffc3004aabfa40639b6eac2ee0de3
[I 16:45:36.970 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[W 16:45:36.973 NotebookApp] No web browser found: could not locate runnable browser.
[C 16:45:36.974 NotebookApp]

    To access the notebook, open this file in a browser:
        file:///root/.local/share/jupyter/runtime/nbserver-138626-open.html
    Or copy and paste one of these URLs:
        http://aidw-001:8888/?token=5ac5bb4710694c7adabffc3004aabfa40639b6eac2ee0de3
     or http://127.0.0.1:8888/?token=5ac5bb4710694c7adabffc3004aabfa40639b6eac2ee0de3

trouble shooting

1.

root@aidw-001:/opt/spark-3.0.2/sbin# ./start-all.sh
starting org.apache.spark.deploy.master.Master, logging to /opt/spark-3.0.2/logs/spark-root-org.apache.spark.deploy.master.Master-1-aidw-001.out
aidw-003: starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark-3.0.2/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-aidw-003.out
aidw-002: starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark-3.0.2/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-aidw-002.out
aidw-004: starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark-3.0.2/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-aidw-004.out
aidw-005: starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark-3.0.2/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-aidw-005.out
aidw-001: starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark-3.0.2/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-aidw-001.out
aidw-004: failed to launch: nice -n 0 /opt/spark-3.0.2/bin/spark-class org.apache.spark.deploy.worker.Worker --webui-port 8081 spark://aidw-001:7077
aidw-004:   /opt/spark-3.0.2/bin/spark-class: 줄 71: /usr/lib/jvm/java-11-openjdk-amd64/bin/java: 그런 파일이나 디렉터리가 없습니다
aidw-004:   /opt/spark-3.0.2/bin/spark-class: 줄 96: CMD: 잘못된 배열 첨자
aidw-004: full log in /opt/spark-3.0.2/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-aidw-004.out
root@aidw-001:/opt/spark-3.0.2/sbin# ./stop-all.sh
aidw-001: stopping org.apache.spark.deploy.worker.Worker
aidw-002: stopping org.apache.spark.deploy.worker.Worker
aidw-004: no org.apache.spark.deploy.worker.Worker to stop
aidw-003: stopping org.apache.spark.deploy.worker.Worker
aidw-005: stopping org.apache.spark.deploy.worker.Worker
stopping org.apache.spark.deploy.master.Master
  • sudo apt install openjdk-11-jdk 설치해주니 해결됨.

Reference

http://hisoftlab.com/4912

좋은 웹페이지 즐겨찾기