Spark 설치
1. 개요
1) 특징
- 하둡기반 오픈소스 분산처리 시스템
- 맵리듀스와 유사한 일괄처리 기능
- 실시간 데이터처리기능 (Spark Streaming)
- SQL과 유사한 정형 데이터 처리 기능 (Spark SQL)
- 머신러닝 알고리즘 (Spark MLlib)
2) 장점
- 데이터를 메모리에 캐시로 저장 (속도업)
- 하둡의 단점인 데이터를 재사용하는 반복적인 알고리즘도 효율적으로 처리
- 하둡의 맵리듀스는 main, mapper, reducer 3개의 클래스가 필요하지만 Spark에서는 간단한 코드로 처리가 가능하다.
3) 단점
- 데이터셋이 작을경우 Spark로 실행하면 오히려 성능이 떨어진다.
4) 실행방법
- 지원언어: 스칼라, 자바, 파이썬, R
- 대화형 Spark shell 제공
- Spark standalone cluster, Hadoop YARN cluster
- yarn에서 돌리려면 master에만 설치하면되지만, Standalone mode를 위해서는 모든 slave에 설치/배포
2. 설치
- yarn위에서 pyspark를 실행 할 예정이다.
1) 다운로드
wget https://downloads.apache.org/spark/spark-3.0.2/spark-3.0.2-bin-hadoop3.2.tgz
2) 환경변수 추가
export SPARK_HOME="/opt/spark-3.0.2"
export PYSPARK="/opt/anaconda3"
export PYSPARK_PYTHON="/opt/anaconda3/bin/python"
export PATH=$HADOOP_HOME/sbin:$SPARK_HOME/bin:$SPARK_HOME/sbin:
## pyspark으로 실행 할 수 있도록 설정
export PYSPARK_DRIVER_PYTHON="jupyter"
export PYSPARK_DRIVER_PYTHON_OPTS="notebook --allow-root"
3) 파일 수정
- conf/spark-env.sh
export JAVA_HOME="/usr/lib/jvm/java-11-openjdk-amd64"
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
- conf/spark-default.sh
spark.master spark://aidw-001:7077
spark.driver.memory 30g
spark.driver.maxResultSize 4g
spark.executor.memory 30g
spark.executor.cores 2
spark.executor.heartbeatInterval 1000s
spark.network.timeout 2000000s
spark.eventLog.enabled true
spark.eventLog.dir hdfs:///sparklog
spark.history.fs.logDirectory hdfs:///sparklog
spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider
spark.sql.warehouse.dir /user/hive/warehouse
4) log4j.properties
log4j.rootCategory=INFO, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n
# Set the default spark-shell log level to WARN. When running the spark-shell, the
# log level for this class is used to overwrite the root logger's log level, so that
# the user can have different defaults for the shell and regular Spark apps.
log4j.logger.org.apache.spark.repl.Main=WARN
# Settings to quiet third party logs that are too verbose
log4j.logger.org.sparkproject.jetty=WARN
log4j.logger.org.sparkproject.jetty.util.component.AbstractLifeCycle=ERROR
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO
log4j.logger.org.apache.parquet=ERROR
log4j.logger.parquet=ERROR
# SPARK-9183: Settings to avoid annoying messages when looking up nonexistent UDFs in SparkSQL with Hive support
log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL
log4j.logger.org.apache.hadoop.hive.ql.exec.FunctionRegistry=ERROR
5) slaves
aidw-001
aidw-002
aidw-003
6) 실행
root@aidw-001:/opt/spark-3.0.2/sbin# ./start-all.sh
starting org.apache.spark.deploy.master.Master, logging to /opt/spark-3.0.2/logs/spark-root-org.apache.spark.deploy.master.Master-1-aidw-001.out
aidw-002: starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark-3.0.2/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-aidw-002.out
aidw-001: starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark-3.0.2/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-aidw-001.out
aidw-003: starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark-3.0.2/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-aidw-003.out
aidw-005: starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark-3.0.2/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-aidw-005.out
aidw-004: starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark-3.0.2/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-aidw-004.out
## worker, master가 실행됨 확인
root@aidw-001:/opt/spark-3.0.2/sbin# jps
137714 Master
134770 JournalNode
114496 QuorumPeerMain
135334 NodeManager
134523 DataNode
135163 ResourceManager
134360 NameNode
137961 Jps
137881 Worker
135005 DFSZKFailoverController
root@aidw-005:/usr/lib/jvm/java-8-openjdk-amd64# jps
3206993 NodeManager
3223091 Worker
3223909 Jps
3223784 Main
3206814 DataNode
7) pyspark 패키지 설치
conda install pyspark
conda install -c conda-forge findspark
8) jupyter notebook 실행 설정
jupyter notebook --generate-config
vi ~/.jupyter/jupyter_notebook_config.py
266번 라인 수정
c.NotebookApp.notebook_dir = '/home/user/work'
9) pyspark 실행
root@aidw-001:/opt/spark-3.0.2/sbin# pyspark
[I 16:45:36.970 NotebookApp] Serving notebooks from local directory: /home/aidw/work
[I 16:45:36.970 NotebookApp] The Jupyter Notebook is running at:
[I 16:45:36.970 NotebookApp] http://aidw-001:8888/?token=5ac5bb4710694c7adabffc3004aabfa40639b6eac2ee0de3
[I 16:45:36.970 NotebookApp] or http://127.0.0.1:8888/?token=5ac5bb4710694c7adabffc3004aabfa40639b6eac2ee0de3
[I 16:45:36.970 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[W 16:45:36.973 NotebookApp] No web browser found: could not locate runnable browser.
[C 16:45:36.974 NotebookApp]
To access the notebook, open this file in a browser:
file:///root/.local/share/jupyter/runtime/nbserver-138626-open.html
Or copy and paste one of these URLs:
http://aidw-001:8888/?token=5ac5bb4710694c7adabffc3004aabfa40639b6eac2ee0de3
or http://127.0.0.1:8888/?token=5ac5bb4710694c7adabffc3004aabfa40639b6eac2ee0de3
trouble shooting
1.
root@aidw-001:/opt/spark-3.0.2/sbin# ./start-all.sh
starting org.apache.spark.deploy.master.Master, logging to /opt/spark-3.0.2/logs/spark-root-org.apache.spark.deploy.master.Master-1-aidw-001.out
aidw-003: starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark-3.0.2/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-aidw-003.out
aidw-002: starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark-3.0.2/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-aidw-002.out
aidw-004: starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark-3.0.2/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-aidw-004.out
aidw-005: starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark-3.0.2/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-aidw-005.out
aidw-001: starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark-3.0.2/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-aidw-001.out
aidw-004: failed to launch: nice -n 0 /opt/spark-3.0.2/bin/spark-class org.apache.spark.deploy.worker.Worker --webui-port 8081 spark://aidw-001:7077
aidw-004: /opt/spark-3.0.2/bin/spark-class: 줄 71: /usr/lib/jvm/java-11-openjdk-amd64/bin/java: 그런 파일이나 디렉터리가 없습니다
aidw-004: /opt/spark-3.0.2/bin/spark-class: 줄 96: CMD: 잘못된 배열 첨자
aidw-004: full log in /opt/spark-3.0.2/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-aidw-004.out
root@aidw-001:/opt/spark-3.0.2/sbin# ./stop-all.sh
aidw-001: stopping org.apache.spark.deploy.worker.Worker
aidw-002: stopping org.apache.spark.deploy.worker.Worker
aidw-004: no org.apache.spark.deploy.worker.Worker to stop
aidw-003: stopping org.apache.spark.deploy.worker.Worker
aidw-005: stopping org.apache.spark.deploy.worker.Worker
stopping org.apache.spark.deploy.master.Master
- sudo apt install openjdk-11-jdk 설치해주니 해결됨.
Reference
Author And Source
이 문제에 관하여(Spark 설치), 우리는 이곳에서 더 많은 자료를 발견하고 링크를 클릭하여 보았다 https://velog.io/@hanovator/Spark-설치저자 귀속: 원작자 정보가 원작자 URL에 포함되어 있으며 저작권은 원작자 소유입니다.
우수한 개발자 콘텐츠 발견에 전념 (Collection and Share based on the CC Protocol.)